Search Results: "oku"

13 July 2022

Reproducible Builds: Reproducible Builds in June 2022

Welcome to the June 2022 report from the Reproducible Builds project. In these reports, we outline the most important things that we have been up to over the past month. As a quick recap, whilst anyone may inspect the source code of free software for malicious flaws, almost all software is distributed to end users as pre-compiled binaries.

Save the date! Despite several delays, we are pleased to announce dates for our in-person summit this year: November 1st 2022 November 3rd 2022
The event will happen in/around Venice (Italy), and we intend to pick a venue reachable via the train station and an international airport. However, the precise venue will depend on the number of attendees. Please see the announcement mail from Mattia Rizzolo, and do keep an eye on the mailing list for further announcements as it will hopefully include registration instructions.

News David Wheeler filed an issue against the Rust programming language to report that builds are not reproducible because full path to the source code is in the panic and debug strings . Luckily, as one of the responses mentions: the --remap-path-prefix solves this problem and has been used to great effect in build systems that rely on reproducibility (Bazel, Nix) to work at all and that there are efforts to teach cargo about it here .
The Python Security team announced that:
The ctx hosted project on PyPI was taken over via user account compromise and replaced with a malicious project which contained runtime code which collected the content of os.environ.items() when instantiating Ctx objects. The captured environment variables were sent as a base64 encoded query parameter to a Heroku application [ ]
As their announcement later goes onto state, version-pinning using hash-checking mode can prevent this attack, although this does depend on specific installations using this mode, rather than a prevention that can be applied systematically.
Developer vanitasvitae published an interesting and entertaining blog post detailing the blow-by-blow steps of debugging a reproducibility issue in PGPainless, a library which aims to make using OpenPGP in Java projects as simple as possible . Whilst their in-depth research into the internals of the .jar may have been unnecessary given that diffoscope would have identified the, it must be said that there is something to be said with occasionally delving into seemingly low-level details, as well describing any debugging process. Indeed, as vanitasvitae writes:
Yes, this would have spared me from 3h of debugging But I probably would also not have gone onto this little dive into the JAR/ZIP format, so in the end I m not mad.

Kees Cook published a short and practical blog post detailing how he uses reproducibility properties to aid work to replace one-element arrays in the Linux kernel. Kees approach is based on the principle that if a (small) proposed change is considered equivalent by the compiler, then the generated output will be identical but only if no other arbitrary or unrelated changes are introduced. Kees mentions the fantastic diffoscope tool, as well as various kernel-specific build options (eg. KBUILD_BUILD_TIMESTAMP) in order to prepare my build with the known to disrupt code layout options disabled .
Stefano Zacchiroli gave a presentation at GDR S curit Informatique based in part on a paper co-written with Chris Lamb titled Increasing the Integrity of Software Supply Chains. (Tweet)

Debian In Debian in this month, 28 reviews of Debian packages were added, 35 were updated and 27 were removed this month adding to our knowledge about identified issues. Two issue types were added: nondeterministic_checksum_generated_by_coq and nondetermistic_js_output_from_webpack. After Holger Levsen found hundreds of packages in the bookworm distribution that lack .buildinfo files, he uploaded 404 source packages to the archive (with no meaningful source changes). Currently bookworm now shows only 8 packages without .buildinfo files, and those 8 are fixed in unstable and should migrate shortly. By contrast, Debian unstable will always have packages without .buildinfo files, as this is how they come through the NEW queue. However, as these packages were not built on the official build servers (ie. they were uploaded by the maintainer) they will never migrate to Debian testing. In the future, therefore, testing should never have packages without .buildinfo files again. Roland Clobus posted yet another in-depth status report about his progress making the Debian Live images build reproducibly to our mailing list. In this update, Roland mentions that all major desktops build reproducibly with bullseye, bookworm and sid but also goes on to outline the progress made with automated testing of the generated images using openQA.

GNU Guix Vagrant Cascadian made a significant number of contributions to GNU Guix: Elsewhere in GNU Guix, Ludovic Court s published a paper in the journal The Art, Science, and Engineering of Programming called Building a Secure Software Supply Chain with GNU Guix:
This paper focuses on one research question: how can [Guix]((https://www.gnu.org/software/guix/) and similar systems allow users to securely update their software? [ ] Our main contribution is a model and tool to authenticate new Git revisions. We further show how, building on Git semantics, we build protections against downgrade attacks and related threats. We explain implementation choices. This work has been deployed in production two years ago, giving us insight on its actual use at scale every day. The Git checkout authentication at its core is applicable beyond the specific use case of Guix, and we think it could benefit to developer teams that use Git.
A full PDF of the text is available.

openSUSE In the world of openSUSE, SUSE announced at SUSECon that they are preparing to meet SLSA level 4. (SLSA (Supply chain Levels for Software Artifacts) is a new industry-led standardisation effort that aims to protect the integrity of the software supply chain.) However, at the time of writing, timestamps within RPM archives are not normalised, so bit-for-bit identical reproducible builds are not possible. Some in-toto provenance files published for SUSE s SLE-15-SP4 as one result of the SLSA level 4 effort. Old binaries are not rebuilt, so only new builds (e.g. maintenance updates) have this metadata added. Lastly, Bernhard M. Wiedemann posted his usual monthly openSUSE reproducible builds status report.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 215, 216 and 217 to Debian unstable. Chris Lamb also made the following changes:
  • New features:
    • Print profile output if we were called with --profile and we were killed via a TERM signal. This should help in situations where diffoscope is terminated due to some sort of timeout. [ ]
    • Support both PyPDF 1.x and 2.x. [ ]
  • Bug fixes:
    • Also catch IndexError exceptions (in addition to ValueError) when parsing .pyc files. (#1012258)
    • Correct the logic for supporting different versions of the argcomplete module. [ ]
  • Output improvements:
    • Don t leak the (likely-temporary) pathname when comparing PDF documents. [ ]
  • Logging improvements:
    • Update test fixtures for GNU readelf 2.38 (now in Debian unstable). [ ][ ]
    • Be more specific about the minimum required version of readelf (ie. binutils), as it appears that this patch level version change resulted in a change of output, not the minor version. [ ]
    • Use our @skip_unless_tool_is_at_least decorator (NB. at_least) over @skip_if_tool_version_is (NB. is) to fix tests under Debian stable. [ ]
    • Emit a warning if/when we are handling a UNIX TERM signal. [ ]
  • Codebase improvements:
    • Clarify in what situations the main finally block gets called with respect to TERM signal handling. [ ]
    • Clarify control flow in the diffoscope.profiling module. [ ]
    • Correctly package the scripts/ directory. [ ]
In addition, Edward Betts updated a broken link to the RSS on the diffoscope homepage and Vagrant Cascadian updated the diffoscope package in GNU Guix [ ][ ][ ].

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:

Testing framework The Reproducible Builds project runs a significant testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Holger Levsen:
    • Add a package set for packages that use the R programming language [ ] as well as one for Rust [ ].
    • Improve package set matching for Python [ ] and font-related [ ] packages.
    • Install the lz4, lzop and xz-utils packages on all nodes in order to detect running kernels. [ ]
    • Improve the cleanup mechanisms when testing the reproducibility of Debian Live images. [ ][ ]
    • In the automated node health checks, deprioritise the generic kernel warning . [ ]
  • Roland Clobus (Debian Live image reproducibility):
    • Add various maintenance jobs to the Jenkins view. [ ]
    • Cleanup old workspaces after 24 hours. [ ]
    • Cleanup temporary workspace and resulting directories. [ ]
    • Implement a number of fixes and improvements around publishing files. [ ][ ][ ]
    • Don t attempt to preserve the file timestamps when copying artifacts. [ ]
And finally, node maintenance was also performed by Mattia Rizzolo [ ].

Mailing list and website On our mailing list this month: Lastly, Chris Lamb updated the main Reproducible Builds website and documentation in a number of small ways, but primarily published an interview with Hans-Christoph Steiner of the F-Droid project. Chris Lamb also added a Coffeescript example for parsing and using the SOURCE_DATE_EPOCH environment variable [ ]. In addition, Sebastian Crane very-helpfully updated the screenshot of salsa.debian.org s request access button on the How to join the Salsa group. [ ]

Contact If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

29 June 2022

Tim Retout: Git internals and SHA-1

LWN reminds us that Git still uses SHA-1 by default. Commit or tag signing is not a mitigation, and to understand why you need to know a little about Git s internal structure. Git internally looks rather like a content-addressable filesystem, with four object types: tags, commits, trees and blobs. Content-addressable means changing the content of an object changes the way you address or reference it, and this is achieved using a cryptographic hash function. Here is an illustration of the internal structure of an example repository I created, containing two files (./foo.txt and ./bar/bar.txt) committed separately, and then tagged: Graphic showing an example Git internal structure featuring tags, commits, trees and blobs, and how these relate to each other. You can see how trees represent directories, blobs represent files, and so on. Git can avoid internal duplication of files or directories which remain identical. The hash function allows very efficient lookup of each object within git s on-disk storage. Tag and commit signatures do not directly sign the files in the repository; that is, the input to the signature function is the content of the tag/commit object, rather than the files themselves. This is analogous to the way that GPG signatures actually sign a cryptographic hash of your email, and there was a time when this too defaulted to SHA-1. An attacker who can break that hash function can bypass the guarantees of the signature function. A motivated attacker might be able to replace a blob, commit or tree in a git repository using a SHA-1 collision. Replacing a blob seems easier to me than a commit or tree, because there is no requirement that the content of the files must conform to any particular format. There is one key technical mitigation to this in Git, which is the SHA-1DC algorithm; this aims to detect and prevent known collision attacks. However, I will have to leave the cryptanalysis of this to the cryptographers! So, is this in your threat model? Do we need to lobby GitHub for SHA-256 support? Either way, I look forward to the future operational challenge of migrating the entire world s git repositories across to SHA-256.

17 June 2022

Antoine Beaupr : Matrix notes

I have some concerns about Matrix (the protocol, not the movie that came out recently, although I do have concerns about that as well). I've been watching the project for a long time, and it seems more a promising alternative to many protocols like IRC, XMPP, and Signal. This review may sound a bit negative, because it focuses on those concerns. I am the operator of an IRC network and people keep asking me to bridge it with Matrix. I have myself considered just giving up on IRC and converting to Matrix. This space is a living document exploring my research of that problem space. The TL;DR: is that no, I'm not setting up a bridge just yet, and I'm still on IRC. This article was written over the course of the last three months, but I have been watching the Matrix project for years (my logs seem to say 2016 at least). The article is rather long. It will likely take you half an hour to read, so copy this over to your ebook reader, your tablet, or dead trees, and lean back and relax as I show you around the Matrix. Or, alternatively, just jump to a section that interest you, most likely the conclusion.

Introduction to Matrix Matrix is an "open standard for interoperable, decentralised, real-time communication over IP. It can be used to power Instant Messaging, VoIP/WebRTC signalling, Internet of Things communication - or anywhere you need a standard HTTP API for publishing and subscribing to data whilst tracking the conversation history". It's also (when compared with XMPP) "an eventually consistent global JSON database with an HTTP API and pubsub semantics - whilst XMPP can be thought of as a message passing protocol." According to their FAQ, the project started in 2014, has about 20,000 servers, and millions of users. Matrix works over HTTPS but over a special port: 8448.

Security and privacy I have some concerns about the security promises of Matrix. It's advertised as a "secure" with "E2E [end-to-end] encryption", but how does it actually work?

Data retention defaults One of my main concerns with Matrix is data retention, which is a key part of security in a threat model where (for example) an hostile state actor wants to surveil your communications and can seize your devices. On IRC, servers don't actually keep messages all that long: they pass them along to other servers and clients as fast as they can, only keep them in memory, and move on to the next message. There are no concerns about data retention on messages (and their metadata) other than the network layer. (I'm ignoring the issues with user registration, which is a separate, if valid, concern.) Obviously, an hostile server could log everything passing through it, but IRC federations are normally tightly controlled. So, if you trust your IRC operators, you should be fairly safe. Obviously, clients can (and often do, even if OTR is configured!) log all messages, but this is generally not the default. Irssi, for example, does not log by default. IRC bouncers are more likely to log to disk, of course, to be able to do what they do. Compare this to Matrix: when you send a message to a Matrix homeserver, that server first stores it in its internal SQL database. Then it will transmit that message to all clients connected to that server and room, and to all other servers that have clients connected to that room. Those remote servers, in turn, will keep a copy of that message and all its metadata in their own database, by default forever. On encrypted rooms those messages are encrypted, but not their metadata. There is a mechanism to expire entries in Synapse, but it is not enabled by default. So one should generally assume that a message sent on Matrix is never expired.

GDPR in the federation But even if that setting was enabled by default, how do you control it? This is a fundamental problem of the federation: if any user is allowed to join a room (which is the default), those user's servers will log all content and metadata from that room. That includes private, one-on-one conversations, since those are essentially rooms as well. In the context of the GDPR, this is really tricky: who is the responsible party (known as the "data controller") here? It's basically any yahoo who fires up a home server and joins a room. In a federated network, one has to wonder whether GDPR enforcement is even possible at all. But in Matrix in particular, if you want to enforce your right to be forgotten in a given room, you would have to:
  1. enumerate all the users that ever joined the room while you were there
  2. discover all their home servers
  3. start a GDPR procedure against all those servers
I recognize this is a hard problem to solve while still keeping an open ecosystem. But I believe that Matrix should have much stricter defaults towards data retention than right now. Message expiry should be enforced by default, for example. (Note that there are also redaction policies that could be used to implement part of the GDPR automatically, see the privacy policy discussion below on that.) Also keep in mind that, in the brave new peer-to-peer world that Matrix is heading towards, the boundary between server and client is likely to be fuzzier, which would make applying the GDPR even more difficult. Update: this comment links to this post (in german) which apparently studied the question and concluded that Matrix is not GDPR-compliant. In fact, maybe Synapse should be designed so that there's no configurable flag to turn off data retention. A bit like how most system loggers in UNIX (e.g. syslog) come with a log retention system that typically rotate logs after a few weeks or month. Historically, this was designed to keep hard drives from filling up, but it also has the added benefit of limiting the amount of personal information kept on disk in this modern day. (Arguably, syslog doesn't rotate logs on its own, but, say, Debian GNU/Linux, as an installed system, does have log retention policies well defined for installed packages, and those can be discussed. And "no expiry" is definitely a bug.

Matrix.org privacy policy When I first looked at Matrix, five years ago, Element.io was called Riot.im and had a rather dubious privacy policy:
We currently use cookies to support our use of Google Analytics on the Website and Service. Google Analytics collects information about how you use the Website and Service. [...] This helps us to provide you with a good experience when you browse our Website and use our Service and also allows us to improve our Website and our Service.
When I asked Matrix people about why they were using Google Analytics, they explained this was for development purposes and they were aiming for velocity at the time, not privacy (paraphrasing here). They also included a "free to snitch" clause:
If we are or believe that we are under a duty to disclose or share your personal data, we will do so in order to comply with any legal obligation, the instructions or requests of a governmental authority or regulator, including those outside of the UK.
Those are really broad terms, above and beyond what is typically expected legally. Like the current retention policies, such user tracking and ... "liberal" collaboration practices with the state set a bad precedent for other home servers. Thankfully, since the above policy was published (2017), the GDPR was "implemented" (2018) and it seems like both the Element.io privacy policy and the Matrix.org privacy policy have been somewhat improved since. Notable points of the new privacy policies:
  • 2.3.1.1: the "federation" section actually outlines that "Federated homeservers and Matrix clients which respect the Matrix protocol are expected to honour these controls and redaction/erasure requests, but other federated homeservers are outside of the span of control of Element, and we cannot guarantee how this data will be processed"
  • 2.6: users under the age of 16 should not use the matrix.org service
  • 2.10: Upcloud, Mythic Beast, Amazon, and CloudFlare possibly have access to your data (it's nice to at least mention this in the privacy policy: many providers don't even bother admitting to this kind of delegation)
  • Element 2.2.1: mentions many more third parties (Twilio, Stripe, Quaderno, LinkedIn, Twitter, Google, Outplay, PipeDrive, HubSpot, Posthog, Sentry, and Matomo (phew!) used when you are paying Matrix.org for hosting
I'm not super happy with all the trackers they have on the Element platform, but then again you don't have to use that service. Your favorite homeserver (assuming you are not on Matrix.org) probably has their own Element deployment, hopefully without all that garbage. Overall, this is all a huge improvement over the previous privacy policy, so hats off to the Matrix people for figuring out a reasonable policy in such a tricky context. I particularly like this bit:
We will forget your copy of your data upon your request. We will also forward your request to be forgotten onto federated homeservers. However - these homeservers are outside our span of control, so we cannot guarantee they will forget your data.
It's great they implemented those mechanisms and, after all, if there's an hostile party in there, nothing can prevent them from using screenshots to just exfiltrate your data away from the client side anyways, even with services typically seen as more secure, like Signal. As an aside, I also appreciate that Matrix.org has a fairly decent code of conduct, based on the TODO CoC which checks all the boxes in the geekfeminism wiki.

Metadata handling Overall, privacy protections in Matrix mostly concern message contents, not metadata. In other words, who's talking with who, when and from where is not well protected. Compared to a tool like Signal, which goes through great lengths to anonymize that data with features like private contact discovery, disappearing messages, sealed senders, and private groups, Matrix is definitely behind. (Note: there is an issue open about message lifetimes in Element since 2020, but it's not at even at the MSC stage yet.) This is a known issue (opened in 2019) in Synapse, but this is not just an implementation issue, it's a flaw in the protocol itself. Home servers keep join/leave of all rooms, which gives clear text information about who is talking to. Synapse logs may also contain privately identifiable information that home server admins might not be aware of in the first place. Those log rotation policies are separate from the server-level retention policy, which may be confusing for a novice sysadmin. Combine this with the federation: even if you trust your home server to do the right thing, the second you join a public room with third-party home servers, those ideas kind of get thrown out because those servers can do whatever they want with that information. Again, a problem that is hard to solve in any federation. To be fair, IRC doesn't have a great story here either: any client knows not only who's talking to who in a room, but also typically their client IP address. Servers can (and often do) obfuscate this, but often that obfuscation is trivial to reverse. Some servers do provide "cloaks" (sometimes automatically), but that's kind of a "slap-on" solution that actually moves the problem elsewhere: now the server knows a little more about the user. Overall, I would worry much more about a Matrix home server seizure than a IRC or Signal server seizure. Signal does get subpoenas, and they can only give out a tiny bit of information about their users: their phone number, and their registration, and last connection date. Matrix carries a lot more information in its database.

Amplification attacks on URL previews I (still!) run an Icecast server and sometimes share links to it on IRC which, obviously, also ends up on (more than one!) Matrix home servers because some people connect to IRC using Matrix. This, in turn, means that Matrix will connect to that URL to generate a link preview. I feel this outlines a security issue, especially because those sockets would be kept open seemingly forever. I tried to warn the Matrix security team but somehow, I don't think this issue was taken very seriously. Here's the disclosure timeline:
  • January 18: contacted Matrix security
  • January 19: response: already reported as a bug
  • January 20: response: can't reproduce
  • January 31: timeout added, considered solved
  • January 31: I respond that I believe the security issue is underestimated, ask for clearance to disclose
  • February 1: response: asking for two weeks delay after the next release (1.53.0) including another patch, presumably in two weeks' time
  • February 22: Matrix 1.53.0 released
  • April 14: I notice the release, ask for clearance again
  • April 14: response: referred to the public disclosure
There are a couple of problems here:
  1. the bug was publicly disclosed in September 2020, and not considered a security issue until I notified them, and even then, I had to insist
  2. no clear disclosure policy timeline was proposed or seems established in the project (there is a security disclosure policy but it doesn't include any predefined timeline)
  3. I wasn't informed of the disclosure
  4. the actual solution is a size limit (10MB, already implemented), a time limit (30 seconds, implemented in PR 11784), and a content type allow list (HTML, "media" or JSON, implemented in PR 11936), and I'm not sure it's adequate
  5. (pure vanity:) I did not make it to their Hall of fame
I'm not sure those solutions are adequate because they all seem to assume a single home server will pull that one URL for a little while then stop. But in a federated network, many (possibly thousands) home servers may be connected in a single room at once. If an attacker drops a link into such a room, all those servers would connect to that link all at once. This is an amplification attack: a small amount of traffic will generate a lot more traffic to a single target. It doesn't matter there are size or time limits: the amplification is what matters here. It should also be noted that clients that generate link previews have more amplification because they are more numerous than servers. And of course, the default Matrix client (Element) does generate link previews as well. That said, this is possibly not a problem specific to Matrix: any federated service that generates link previews may suffer from this. I'm honestly not sure what the solution is here. Maybe moderation? Maybe link previews are just evil? All I know is there was this weird bug in my Icecast server and I tried to ring the bell about it, and it feels it was swept under the rug. Somehow I feel this is bound to blow up again in the future, even with the current mitigation.

Moderation In Matrix like elsewhere, Moderation is a hard problem. There is a detailed moderation guide and much of this problem space is actively worked on in Matrix right now. A fundamental problem with moderating a federated space is that a user banned from a room can rejoin the room from another server. This is why spam is such a problem in Email, and why IRC networks have stopped federating ages ago (see the IRC history for that fascinating story).

The mjolnir bot The mjolnir moderation bot is designed to help with some of those things. It can kick and ban users, redact all of a user's message (as opposed to one by one), all of this across multiple rooms. It can also subscribe to a federated block list published by matrix.org to block known abusers (users or servers). Bans are pretty flexible and can operate at the user, room, or server level. Matrix people suggest making the bot admin of your channels, because you can't take back admin from a user once given.

The command-line tool There's also a new command line tool designed to do things like:
  • System notify users (all users/users from a list, specific user)
  • delete sessions/devices not seen for X days
  • purge the remote media cache
  • select rooms with various criteria (external/local/empty/created by/encrypted/cleartext)
  • purge history of theses rooms
  • shutdown rooms
This tool and Mjolnir are based on the admin API built into Synapse.

Rate limiting Synapse has pretty good built-in rate-limiting which blocks repeated login, registration, joining, or messaging attempts. It may also end up throttling servers on the federation based on those settings.

Fundamental federation problems Because users joining a room may come from another server, room moderators are at the mercy of the registration and moderation policies of those servers. Matrix is like IRC's +R mode ("only registered users can join") by default, except that anyone can register their own homeserver, which makes this limited. Server admins can block IP addresses and home servers, but those tools are not easily available to room admins. There is an API (m.room.server_acl in /devtools) but it is not reliable (thanks Austin Huang for the clarification). Matrix has the concept of guest accounts, but it is not used very much, and virtually no client or homeserver supports it. This contrasts with the way IRC works: by default, anyone can join an IRC network even without authentication. Some channels require registration, but in general you are free to join and look around (until you get blocked, of course). I have seen anecdotal evidence (CW: Twitter, nitter link) that "moderating bridges is hell", and I can imagine why. Moderation is already hard enough on one federation, when you bridge a room with another network, you inherit all the problems from that network but without the entire abuse control tools from the original network's API...

Room admins Matrix, in particular, has the problem that room administrators (which have the power to redact messages, ban users, and promote other users) are bound to their Matrix ID which is, in turn, bound to their home servers. This implies that a home server administrators could (1) impersonate a given user and (2) use that to hijack the room. So in practice, the home server is the trust anchor for rooms, not the user themselves. That said, if server B administrator hijack user joe on server B, they will hijack that room on that specific server. This will not (necessarily) affect users on the other servers, as servers could refuse parts of the updates or ban the compromised account (or server). It does seem like a major flaw that room credentials are bound to Matrix identifiers, as opposed to the E2E encryption credentials. In an encrypted room even with fully verified members, a compromised or hostile home server can still take over the room by impersonating an admin. That admin (or even a newly minted user) can then send events or listen on the conversations. This is even more frustrating when you consider that Matrix events are actually signed and therefore have some authentication attached to them, acting like some sort of Merkle tree (as it contains a link to previous events). That signature, however, is made from the homeserver PKI keys, not the client's E2E keys, which makes E2E feel like it has been "bolted on" later.

Availability While Matrix has a strong advantage over Signal in that it's decentralized (so anyone can run their own homeserver,), I couldn't find an easy way to run a "multi-primary" setup, or even a "redundant" setup (even if with a single primary backend), short of going full-on "replicate PostgreSQL and Redis data", which is not typically for the faint of heart.

How this works in IRC On IRC, it's quite easy to setup redundant nodes. All you need is:
  1. a new machine (with it's own public address with an open port)
  2. a shared secret (or certificate) between that machine and an existing one on the network
  3. a connect block on both servers
That's it: the node will join the network and people can connect to it as usual and share the same user/namespace as the rest of the network. The servers take care of synchronizing state: you do not need to worry about replicating a database server. (Now, experienced IRC people will know there's a catch here: IRC doesn't have authentication built in, and relies on "services" which are basically bots that authenticate users (I'm simplifying, don't nitpick). If that service goes down, the network still works, but then people can't authenticate, and they can start doing nasty things like steal people's identity if they get knocked offline. But still: basic functionality still works: you can talk in rooms and with users that are on the reachable network.)

User identities Matrix is more complicated. Each "home server" has its own identity namespace: a specific user (say @anarcat:matrix.org) is bound to that specific home server. If that server goes down, that user is completely disconnected. They could register a new account elsewhere and reconnect, but then they basically lose all their configuration: contacts, joined channels are all lost. (Also notice how the Matrix IDs don't look like a typical user address like an email in XMPP. They at least did their homework and got the allocation for the scheme.)

Rooms Users talk to each other in "rooms", even in one-to-one communications. (Rooms are also used for other things like "spaces", they're basically used for everything, think "everything is a file" kind of tool.) For rooms, home servers act more like IRC nodes in that they keep a local state of the chat room and synchronize it with other servers. Users can keep talking inside a room if the server that originally hosts the room goes down. Rooms can have a local, server-specific "alias" so that, say, #room:matrix.org is also visible as #room:example.com on the example.com home server. Both addresses refer to the same room underlying room. (Finding this in the Element settings is not obvious though, because that "alias" are actually called a "local address" there. So to create such an alias (in Element), you need to go in the room settings' "General" section, "Show more" in "Local address", then add the alias name (e.g. foo), and then that room will be available on your example.com homeserver as #foo:example.com.) So a room doesn't belong to a server, it belongs to the federation, and anyone can join the room from any serer (if the room is public, or if invited otherwise). You can create a room on server A and when a user from server B joins, the room will be replicated on server B as well. If server A fails, server B will keep relaying traffic to connected users and servers. A room is therefore not fundamentally addressed with the above alias, instead ,it has a internal Matrix ID, which basically a random string. It has a server name attached to it, but that was made just to avoid collisions. That can get a little confusing. For example, the #fractal:gnome.org room is an alias on the gnome.org server, but the room ID is !hwiGbsdSTZIwSRfybq:matrix.org. That's because the room was created on matrix.org, but the preferred branding is gnome.org now. As an aside, rooms, by default, live forever, even after the last user quits. There's an admin API to delete rooms and a tombstone event to redirect to another one, but neither have a GUI yet. The latter is part of MSC1501 ("Room version upgrades") which allows a room admin to close a room, with a message and a pointer to another room.

Spaces Discovering rooms can be tricky: there is a per-server room directory, but Matrix.org people are trying to deprecate it in favor of "Spaces". Room directories were ripe for abuse: anyone can create a room, so anyone can show up in there. It's possible to restrict who can add aliases, but anyways directories were seen as too limited. In contrast, a "Space" is basically a room that's an index of other rooms (including other spaces), so existing moderation and administration mechanism that work in rooms can (somewhat) work in spaces as well. This enables a room directory that works across federation, regardless on which server they were originally created. New users can be added to a space or room automatically in Synapse. (Existing users can be told about the space with a server notice.) This gives admins a way to pre-populate a list of rooms on a server, which is useful to build clusters of related home servers, providing some sort of redundancy, at the room -- not user -- level.

Home servers So while you can workaround a home server going down at the room level, there's no such thing at the home server level, for user identities. So if you want those identities to be stable in the long term, you need to think about high availability. One limitation is that the domain name (e.g. matrix.example.com) must never change in the future, as renaming home servers is not supported. The documentation used to say you could "run a hot spare" but that has been removed. Last I heard, it was not possible to run a high-availability setup where multiple, separate locations could replace each other automatically. You can have high performance setups where the load gets distributed among workers, but those are based on a shared database (Redis and PostgreSQL) backend. So my guess is it would be possible to create a "warm" spare server of a matrix home server with regular PostgreSQL replication, but that is not documented in the Synapse manual. This sort of setup would also not be useful to deal with networking issues or denial of service attacks, as you will not be able to spread the load over multiple network locations easily. Redis and PostgreSQL heroes are welcome to provide their multi-primary solution in the comments. In the meantime, I'll just point out this is a solution that's handled somewhat more gracefully in IRC, by having the possibility of delegating the authentication layer.

Delegations If you do not want to run a Matrix server yourself, it's possible to delegate the entire thing to another server. There's a server discovery API which uses the .well-known pattern (or SRV records, but that's "not recommended" and a bit confusing) to delegate that service to another server. Be warned that the server still needs to be explicitly configured for your domain. You can't just put:
  "m.server": "matrix.org:443"  
... on https://example.com/.well-known/matrix/server and start using @you:example.com as a Matrix ID. That's because Matrix doesn't support "virtual hosting" and you'd still be connecting to rooms and people with your matrix.org identity, not example.com as you would normally expect. This is also why you cannot rename your home server. The server discovery API is what allows servers to find each other. Clients, on the other hand, use the client-server discovery API: this is what allows a given client to find your home server when you type your Matrix ID on login.

Performance The high availability discussion brushed over the performance of Matrix itself, but let's now dig into that.

Horizontal scalability There were serious scalability issues of the main Matrix server, Synapse, in the past. So the Matrix team has been working hard to improve its design. Since Synapse 1.22 the home server can horizontally scale to multiple workers (see this blog post for details) which can make it easier to scale large servers.

Other implementations There are other promising home servers implementations from a performance standpoint (dendrite, Golang, entered beta in late 2020; conduit, Rust, beta; others), but none of those are feature-complete so there's a trade-off to be made there. Synapse is also adding a lot of feature fast, so it's an open question whether the others will ever catch up. (I have heard that Dendrite might actually surpass Synapse in features within a few years, which would put Synapse in a more "LTS" situation.)

Latency Matrix can feel slow sometimes. For example, joining the "Matrix HQ" room in Element (from matrix.debian.social) takes a few minutes and then fails. That is because the home server has to sync the entire room state when you join the room. There was promising work on this announced in the lengthy 2021 retrospective, and some of that work landed (partial sync) in the 1.53 release already. Other improvements coming include sliding sync, lazy loading over federation, and fast room joins. So that's actually something that could be fixed in the fairly short term. But in general, communication in Matrix doesn't feel as "snappy" as on IRC or even Signal. It's hard to quantify this without instrumenting a full latency test bed (for example the tools I used in the terminal emulators latency tests), but even just typing in a web browser feels slower than typing in a xterm or Emacs for me. Even in conversations, I "feel" people don't immediately respond as fast. In fact, this could be an interesting double-blind experiment to make: have people guess whether they are talking to a person on Matrix, XMPP, or IRC, for example. My theory would be that people could notice that Matrix users are slower, if only because of the TCP round-trip time each message has to take.

Transport Some courageous person actually made some tests of various messaging platforms on a congested network. His evaluation was basically:
  • Briar: uses Tor, so unusable except locally
  • Matrix: "struggled to send and receive messages", joining a room takes forever as it has to sync all history, "took 20-30 seconds for my messages to be sent and another 20 seconds for further responses"
  • XMPP: "worked in real-time, full encryption, with nearly zero lag"
So that was interesting. I suspect IRC would have also fared better, but that's just a feeling. Other improvements to the transport layer include support for websocket and the CoAP proxy work from 2019 (targeting 100bps links), but both seem stalled at the time of writing. The Matrix people have also announced the pinecone p2p overlay network which aims at solving large, internet-scale routing problems. See also this talk at FOSDEM 2022.

Usability

Onboarding and workflow The workflow for joining a room, when you use Element web, is not great:
  1. click on a link in a web browser
  2. land on (say) https://matrix.to/#/#matrix-dev:matrix.org
  3. offers "Element", yeah that's sounds great, let's click "Continue"
  4. land on https://app.element.io/#/room%2F%23matrix-dev%3Amatrix.org and then you need to register, aaargh
As you might have guessed by now, there is a specification to solve this, but web browsers need to adopt it as well, so that's far from actually being solved. At least browsers generally know about the matrix: scheme, it's just not exactly clear what they should do with it, especially when the handler is just another web page (e.g. Element web). In general, when compared with tools like Signal or WhatsApp, Matrix doesn't fare so well in terms of user discovery. I probably have some of my normal contacts that have a Matrix account as well, but there's really no way to know. It's kind of creepy when Signal tells you "this person is on Signal!" but it's also pretty cool that it works, and they actually implemented it pretty well. Registration is also less obvious: in Signal, the app confirms your phone number automatically. It's friction-less and quick. In Matrix, you need to learn about home servers, pick one, register (with a password! aargh!), and then setup encryption keys (not default), etc. It's a lot more friction. And look, I understand: giving away your phone number is a huge trade-off. I don't like it either. But it solves a real problem and makes encryption accessible to a ton more people. Matrix does have "identity servers" that can serve that purpose, but I don't feel confident sharing my phone number there. It doesn't help that the identity servers don't have private contact discovery: giving them your phone number is a more serious security compromise than with Signal. There's a catch-22 here too: because no one feels like giving away their phone numbers, no one does, and everyone assumes that stuff doesn't work anyways. Like it or not, Signal forcing people to divulge their phone number actually gives them critical mass that means actually a lot of my relatives are on Signal and I don't have to install crap like WhatsApp to talk with them.

5 minute clients evaluation Throughout all my tests I evaluated a handful of Matrix clients, mostly from Flathub because almost none of them are packaged in Debian. Right now I'm using Element, the flagship client from Matrix.org, in a web browser window, with the PopUp Window extension. This makes it look almost like a native app, and opens links in my main browser window (instead of a new tab in that separate window), which is nice. But I'm tired of buying memory to feed my web browser, so this indirection has to stop. Furthermore, I'm often getting completely logged off from Element, which means re-logging in, recovering my security keys, and reconfiguring my settings. That is extremely annoying. Coming from Irssi, Element is really "GUI-y" (pronounced "gooey"). Lots of clickety happening. To mark conversations as read, in particular, I need to click-click-click on all the tabs that have some activity. There's no "jump to latest message" or "mark all as read" functionality as far as I could tell. In Irssi the former is built-in (alt-a) and I made a custom /READ command for the latter:
/ALIAS READ script exec \$_->activity(0) for Irssi::windows
And yes, that's a Perl script in my IRC client. I am not aware of any Matrix client that does stuff like that, except maybe Weechat, if we can call it a Matrix client, or Irssi itself, now that it has a Matrix plugin (!). As for other clients, I have looked through the Matrix Client Matrix (confusing right?) to try to figure out which one to try, and, even after selecting Linux as a filter, the chart is just too wide to figure out anything. So I tried those, kind of randomly:
  • Fractal
  • Mirage
  • Nheko
  • Quaternion
Unfortunately, I lost my notes on those, I don't actually remember which one did what. I still have a session open with Mirage, so I guess that means it's the one I preferred, but I remember they were also all very GUI-y. Maybe I need to look at weechat-matrix or gomuks. At least Weechat is scriptable so I could continue playing the power-user. Right now my strategy with messaging (and that includes microblogging like Twitter or Mastodon) is that everything goes through my IRC client, so Weechat could actually fit well in there. Going with gomuks, on the other hand, would mean running it in parallel with Irssi or ... ditching IRC, which is a leap I'm not quite ready to take just yet. Oh, and basically none of those clients (except Nheko and Element) support VoIP, which is still kind of a second-class citizen in Matrix. It does not support large multimedia rooms, for example: Jitsi was used for FOSDEM instead of the native videoconferencing system.

Bots This falls a little aside the "usability" section, but I didn't know where to put this... There's a few Matrix bots out there, and you are likely going to be able to replace your existing bots with Matrix bots. It's true that IRC has a long and impressive history with lots of various bots doing various things, but given how young Matrix is, there's still a good variety:
  • maubot: generic bot with tons of usual plugins like sed, dice, karma, xkcd, echo, rss, reminder, translate, react, exec, gitlab/github webhook receivers, weather, etc
  • opsdroid: framework to implement "chat ops" in Matrix, connects with Matrix, GitHub, GitLab, Shell commands, Slack, etc
  • matrix-nio: another framework, used to build lots more bots like:
    • hemppa: generic bot with various functionality like weather, RSS feeds, calendars, cron jobs, OpenStreetmaps lookups, URL title snarfing, wolfram alpha, astronomy pic of the day, Mastodon bridge, room bridging, oh dear
    • devops: ping, curl, etc
    • podbot: play podcast episodes from AntennaPod
    • cody: Python, Ruby, Javascript REPL
    • eno: generic bot, "personal assistant"
  • mjolnir: moderation bot
  • hookshot: bridge with GitLab/GitHub
  • matrix-monitor-bot: latency monitor
One thing I haven't found an equivalent for is Debian's MeetBot. There's an archive bot but it doesn't have topics or a meeting chair, or HTML logs.

Working on Matrix As a developer, I find Matrix kind of intimidating. The specification is huge. The official specification itself looks somewhat digestable: it's only 6 APIs so that looks, at first, kind of reasonable. But whenever you start asking complicated questions about Matrix, you quickly fall into the Matrix Spec Change specification (which, yes, is a separate specification). And there are literally hundreds of MSCs flying around. It's hard to tell what's been adopted and what hasn't, and even harder to figure out if your specific client has implemented it. (One trendy answer to this problem is to "rewrite it in rust": Matrix are working on implementing a lot of those specifications in a matrix-rust-sdk that's designed to take the implementation details away from users.) Just taking the latest weekly Matrix report, you find that three new MSCs proposed, just last week! There's even a graph that shows the number of MSCs is progressing steadily, at 600+ proposals total, with the majority (300+) "new". I would guess the "merged" ones are at about 150. That's a lot of text which includes stuff like 3D worlds which, frankly, I don't think you should be working on when you have such important security and usability problems. (The internet as a whole, arguably, doesn't fare much better. RFC600 is a really obscure discussion about "INTERFACING AN ILLINOIS PLASMA TERMINAL TO THE ARPANET". Maybe that's how many MSCs will end up as well, left forgotten in the pits of history.) And that's the thing: maybe the Matrix people have a different objective than I have. They want to connect everything to everything, and make Matrix a generic transport for all sorts of applications, including virtual reality, collaborative editors, and so on. I just want secure, simple messaging. Possibly with good file transfers, and video calls. That it works with existing stuff is good, and it should be federated to remove the "Signal point of failure". So I'm a bit worried with the direction all those MSCs are taking, especially when you consider that clients other than Element are still struggling to keep up with basic features like end-to-end encryption or room discovery, never mind voice or spaces...

Conclusion Overall, Matrix is somehow in the space XMPP was a few years ago. It has a ton of features, pretty good clients, and a large community. It seems to have gained some of the momentum that XMPP has lost. It may have the most potential to replace Signal if something bad would happen to it (like, I don't know, getting banned or going nuts with cryptocurrency)... But it's really not there yet, and I don't see Matrix trying to get there either, which is a bit worrisome.

Looking back at history I'm also worried that we are repeating the errors of the past. The history of federated services is really fascinating:. IRC, FTP, HTTP, and SMTP were all created in the early days of the internet, and are all still around (except, arguably, FTP, which was removed from major browsers recently). All of them had to face serious challenges in growing their federation. IRC had numerous conflicts and forks, both at the technical level but also at the political level. The history of IRC is really something that anyone working on a federated system should study in detail, because they are bound to make the same mistakes if they are not familiar with it. The "short" version is:
  • 1988: Finnish researcher publishes first IRC source code
  • 1989: 40 servers worldwide, mostly universities
  • 1990: EFnet ("eris-free network") fork which blocks the "open relay", named Eris - followers of Eris form the A-net, which promptly dissolves itself, with only EFnet remaining
  • 1992: Undernet fork, which offered authentication ("services"), routing improvements and timestamp-based channel synchronisation
  • 1994: DALnet fork, from Undernet, again on a technical disagreement
  • 1995: Freenode founded
  • 1996: IRCnet forks from EFnet, following a flame war of historical proportion, splitting the network between Europe and the Americas
  • 1997: Quakenet founded
  • 1999: (XMPP founded)
  • 2001: 6 million users, OFTC founded
  • 2002: DALnet peaks at 136,000 users
  • 2003: IRC as a whole peaks at 10 million users, EFnet peaks at 141,000 users
  • 2004: (Facebook founded), Undernet peaks at 159,000 users
  • 2005: Quakenet peaks at 242,000 users, IRCnet peaks at 136,000 (Youtube founded)
  • 2006: (Twitter founded)
  • 2009: (WhatsApp, Pinterest founded)
  • 2010: (TextSecure AKA Signal, Instagram founded)
  • 2011: (Snapchat founded)
  • ~2013: Freenode peaks at ~100,000 users
  • 2016: IRCv3 standardisation effort started (TikTok founded)
  • 2021: Freenode self-destructs, Libera chat founded
  • 2022: Libera peaks at 50,000 users, OFTC peaks at 30,000 users
(The numbers were taken from the Wikipedia page and Netsplit.de. Note that I also include other networks launch in parenthesis for context.) Pretty dramatic, don't you think? Eventually, somehow, IRC became irrelevant for most people: few people are even aware of it now. With less than a million users active, it's smaller than Mastodon, XMPP, or Matrix at this point.1 If I were to venture a guess, I'd say that infighting, lack of a standardization body, and a somewhat annoying protocol meant the network could not grow. It's also possible that the decentralised yet centralised structure of IRC networks limited their reliability and growth. But large social media companies have also taken over the space: observe how IRC numbers peak around the time the wave of large social media companies emerge, especially Facebook (2.9B users!!) and Twitter (400M users).

Where the federated services are in history Right now, Matrix, and Mastodon (and email!) are at the "pre-EFnet" stage: anyone can join the federation. Mastodon has started working on a global block list of fascist servers which is interesting, but it's still an open federation. Right now, Matrix is totally open, but matrix.org publishes a (federated) block list of hostile servers (#matrix-org-coc-bl:matrix.org, yes, of course it's a room). Interestingly, Email is also in that stage, where there are block lists of spammers, and it's a race between those blockers and spammers. Large email providers, obviously, are getting closer to the EFnet stage: you could consider they only accept email from themselves or between themselves. It's getting increasingly hard to deliver mail to Outlook and Gmail for example, partly because of bias against small providers, but also because they are including more and more machine-learning tools to sort through email and those systems are, fundamentally, unknowable. It's not quite the same as splitting the federation the way EFnet did, but the effect is similar. HTTP has somehow managed to live in a parallel universe, as it's technically still completely federated: anyone can start a web server if they have a public IP address and anyone can connect to it. The catch, of course, is how you find the darn thing. Which is how Google became one of the most powerful corporations on earth, and how they became the gatekeepers of human knowledge online. I have only briefly mentioned XMPP here, and my XMPP fans will undoubtedly comment on that, but I think it's somewhere in the middle of all of this. It was co-opted by Facebook and Google, and both corporations have abandoned it to its fate. I remember fondly the days where I could do instant messaging with my contacts who had a Gmail account. Those days are gone, and I don't talk to anyone over Jabber anymore, unfortunately. And this is a threat that Matrix still has to face. It's also the threat Email is currently facing. On the one hand corporations like Facebook want to completely destroy it and have mostly succeeded: many people just have an email account to register on things and talk to their friends over Instagram or (lately) TikTok (which, I know, is not Facebook, but they started that fire). On the other hand, you have corporations like Microsoft and Google who are still using and providing email services because, frankly, you still do need email for stuff, just like fax is still around but they are more and more isolated in their own silo. At this point, it's only a matter of time they reach critical mass and just decide that the risk of allowing external mail coming in is not worth the cost. They'll simply flip the switch and work on an allow-list principle. Then we'll have closed the loop and email will be dead, just like IRC is "dead" now. I wonder which path Matrix will take. Could it liberate us from these vicious cycles? Update: this generated some discussions on lobste.rs.

  1. According to Wikipedia, there are currently about 500 distinct IRC networks operating, on about 1,000 servers, serving over 250,000 users. In contrast, Mastodon seems to be around 5 million users, Matrix.org claimed at FOSDEM 2021 to have about 28 million globally visible accounts, and Signal lays claim to over 40 million souls. XMPP claims to have "millions" of users on the xmpp.org homepage but the FAQ says they don't actually know. On the proprietary silo side of the fence, this page says
    • Facebook: 2.9 billion users
    • WhatsApp: 2B
    • Instagram: 1.4B
    • TikTok: 1B
    • Snapchat: 500M
    • Pinterest: 480M
    • Twitter: 397M
    Notable omission from that list: Youtube, with its mind-boggling 2.6 billion users... Those are not the kind of numbers you just "need to convince a brother or sister" to grow the network...

16 May 2022

Matthew Garrett: Can we fix bearer tokens?

Last month I wrote about how bearer tokens are just awful, and a week later Github announced that someone had managed to exfiltrate bearer tokens from Heroku that gave them access to, well, a lot of Github repositories. This has inevitably resulted in a whole bunch of discussion about a number of things, but people seem to be largely ignoring the fundamental issue that maybe we just shouldn't have magical blobs that grant you access to basically everything even if you've copied them from a legitimate holder to Honest John's Totally Legitimate API Consumer.

To make it clearer what the problem is here, let's use an analogy. You have a safety deposit box. To gain access to it, you simply need to be able to open it with a key you were given. Anyone who turns up with the key can open the box and do whatever they want with the contents. Unfortunately, the key is extremely easy to copy - anyone who is able to get hold of your keyring for a moment is in a position to duplicate it, and then they have access to the box. Wouldn't it be better if something could be done to ensure that whoever showed up with a working key was someone who was actually authorised to have that key?

To achieve that we need some way to verify the identity of the person holding the key. In the physical world we have a range of ways to achieve this, from simply checking whether someone has a piece of ID that associates them with the safety deposit box all the way up to invasive biometric measurements that supposedly verify that they're definitely the same person. But computers don't have passports or fingerprints, so we need another way to identify them.

When you open a browser and try to connect to your bank, the bank's website provides a TLS certificate that lets your browser know that you're talking to your bank instead of someone pretending to be your bank. The spec allows this to be a bi-directional transaction - you can also prove your identity to the remote website. This is referred to as "mutual TLS", or mTLS, and a successful mTLS transaction ends up with both ends knowing who they're talking to, as long as they have a reason to trust the certificate they were presented with.

That's actually a pretty big constraint! We have a reasonable model for the server - it's something that's issued by a trusted third party and it's tied to the DNS name for the server in question. Clients don't tend to have stable DNS identity, and that makes the entire thing sort of awkward. But, thankfully, maybe we don't need to? We don't need the client to be able to prove its identity to arbitrary third party sites here - we just need the client to be able to prove it's a legitimate holder of whichever bearer token it's presenting to that site. And that's a much easier problem.

Here's the simple solution - clients generate a TLS cert. This can be self-signed, because all we want to do here is be able to verify whether the machine talking to us is the same one that had a token issued to it. The client contacts a service that's going to give it a bearer token. The service requests mTLS auth without being picky about the certificate that's presented. The service embeds a hash of that certificate in the token before handing it back to the client. Whenever the client presents that token to any other service, the service ensures that the mTLS cert the client presented matches the hash in the bearer token. Copy the token without copying the mTLS certificate and the token gets rejected. Hurrah hurrah hats for everyone.

Well except for the obvious problem that if you're in a position to exfiltrate the bearer tokens you can probably just steal the client certificates and keys as well, and now you can pretend to be the original client and this is not adding much additional security. Fortunately pretty much everything we care about has the ability to store the private half of an asymmetric key in hardware (TPMs on Linux and Windows systems, the Secure Enclave on Macs and iPhones, either a piece of magical hardware or Trustzone on Android) in a way that avoids anyone being able to just steal the key.

How do we know that the key is actually in hardware? Here's the fun bit - it doesn't matter. If you're issuing a bearer token to a system then you're already asserting that the system is trusted. If the system is lying to you about whether or not the key it's presenting is hardware-backed then you've already lost. If it lied and the system is later compromised then sure all your apes get stolen, but maybe don't run systems that lie and avoid that situation as a result?

Anyway. This is covered in RFC 8705 so why aren't we all doing this already? From the client side, the largest generic issue is that TPMs are astonishingly slow in comparison to doing a TLS handshake on the CPU. RSA signing operations on TPMs can take around half a second, which doesn't sound too bad, except your browser is probably establishing multiple TLS connections to subdomains on the site it's connecting to and performance is going to tank. Fixing this involves doing whatever's necessary to convince the browser to pipe everything over a single TLS connection, and that's just not really where the web is right at the moment. Using EC keys instead helps a lot (~0.1 seconds per signature on modern TPMs), but it's still going to be a bottleneck.

The other problem, of course, is that ecosystem support for hardware-backed certificates is just awful. Windows lets you stick them into the standard platform certificate store, but the docs for this are hidden in a random PDF in a Github repo. Macs require you to do some weird bridging between the Secure Enclave API and the keychain API. Linux? Well, the standard answer is to do PKCS#11, and I have literally never met anybody who likes PKCS#11 and I have spent a bunch of time in standards meetings with the sort of people you might expect to like PKCS#11 and even they don't like it. It turns out that loading a bunch of random C bullshit that has strong feelings about function pointers into your security critical process is not necessarily something that is going to improve your quality of life, so instead you should use something like this and just have enough C to bridge to a language that isn't secretly plotting to kill your pets the moment you turn your back.

And, uh, obviously none of this matters at all unless people actually support it. Github has no support at all for validating the identity of whoever holds a bearer token. Most issuers of bearer tokens have no support for embedding holder identity into the token. This is not good! As of last week, all three of the big cloud providers support virtualised TPMs in their VMs - we should be running CI on systems that can do that, and tying any issued tokens to the VMs that are supposed to be making use of them.

So sure this isn't trivial. But it's also not impossible, and making this stuff work would improve the security of, well, everything. We literally have the technology to prevent attacks like Github suffered. What do we have to do to get people to actually start working on implementing that?

comment count unavailable comments

27 April 2022

Antoine Beaupr : Using LSP in Emacs and Debian

The Language Server Protocol (LSP) is a neat mechanism that provides a common interface to what used to be language-specific lookup mechanisms (like, say, running a Python interpreter in the background to find function definitions). There is also ctags shipped with UNIX since forever, but that doesn't support looking backwards ("who uses this function"), linting, or refactoring. In short, LSP rocks, and how do I use it right now in my editor of choice (Emacs, in my case) and OS (Debian) please?

Editor (emacs) setup First, you need to setup your editor. The Emacs LSP mode has pretty good installation instructions which, for me, currently mean:
apt install elpa-lsp-mode
and this .emacs snippet:
(use-package lsp-mode
  :commands (lsp lsp-deferred)
  :hook ((python-mode go-mode) . lsp-deferred)
  :demand t
  :init
  (setq lsp-keymap-prefix "C-c l")
  ;; TODO: https://emacs-lsp.github.io/lsp-mode/page/performance/
  ;; also note re "native compilation": <+varemara> it's the
  ;; difference between lsp-mode being usable or not, for me
  :config
  (setq lsp-auto-configure t))
(use-package lsp-ui
  :config
  (setq lsp-ui-flycheck-enable t)
  (add-to-list 'lsp-ui-doc-frame-parameters '(no-accept-focus . t))
  (define-key lsp-ui-mode-map [remap xref-find-definitions] #'lsp-ui-peek-find-definitions)
  (define-key lsp-ui-mode-map [remap xref-find-references] #'lsp-ui-peek-find-references))
Note: this configuration might have changed since I wrote this, see my init.el configuration for the most recent config. The main reason for choosing lsp-mode over eglot is that it's in Debian (and eglot is not). (Apparently, eglot has more chance of being upstreamed, "when it's done", but I guess I'll cross that bridge when I get there.) I already had lsp-mode partially setup in Emacs so I only had to do this small tweak to switch and change the prefix key (because s-l or mod is used by my window manager). I also had to pin LSP packages to bookworm here so that it properly detects pylsp (the older version in Debian bullseye only supports pyls, not packaged in Debian). This won't do anything by itself: Emacs will need something to talk with to provide the magic. Those are called "servers" and are basically different programs, for each programming language, that provide the magic.

Servers setup The Emacs package provides a way (M-x lsp-install-server) to install some of them, but I prefer to manage those tools through Debian packages if possible, just like lsp-mode itself. Those are the servers I currently know of in Debian:
package languages
ccls C, C++, ObjectiveC
clangd C, C++, ObjectiveC
elpa-lsp-haskell Haskell
fortran-language-server Fortran
gopls Golang
python3-pyls Python
There might be more such packages, but those are surprisingly hard to find. I found a few with apt search "Language Server Protocol", but that didn't find ccls, for example, because that just said "Language Server" in the description (which also found a few more pyls plugins, e.g. black support). Note that the Python packages, in particular, need to be upgraded to their bookworm releases to work properly (here). It seems like there's some interoperability problems there that I haven't quite figured out yet. See also my Puppet configuration for LSP. Finally, note that I have now completely switched away from Elpy to pyls, and I'm quite happy with the results. lsp-mode feels slower than elpy but I haven't done any of the performance tuning and this will improve even more with native compilation. And lsp-mode is much more powerful. I particularly like the "rename symbol" functionality, which ... mostly works.

Remaining work

Puppet and Ruby I still have to figure how to actually use this: I mostly spend my time in Puppet these days, there is no server listed in the Emacs lsp-mode language list, but there is one listed over at the upstream language list, the puppet-editor-services server. But it's not packaged in Debian, and seems somewhat... involved. It could still be a good productivity boost. The Voxpupuli team have vim install instructions which also suggest installing solargraph, the Ruby language server, also not packaged in Debian.

Bash I guess I do a bit of shell scripting from time to time nowadays, even though I don't like it. So the bash-language-server may prove useful as well.

Other languages Here are more language servers available:

5 April 2022

Kees Cook: security things in Linux v5.10

Previously: v5.9 Linux v5.10 was released in December, 2020. Here s my summary of various security things that I found interesting: AMD SEV-ES
While guest VM memory encryption with AMD SEV has been supported for a while, Joerg Roedel, Thomas Lendacky, and others added register state encryption (SEV-ES). This means it s even harder for a VM host to reconstruct a guest VM s state. x86 static calls
Josh Poimboeuf and Peter Zijlstra implemented static calls for x86, which operates very similarly to the static branch infrastructure in the kernel. With static branches, an if/else choice can be hard-coded, instead of being run-time evaluated every time. Such branches can be updated too (the kernel just rewrites the code to switch around the branch ). All these principles apply to static calls as well, but they re for replacing indirect function calls (i.e. a call through a function pointer) with a direct call (i.e. a hard-coded call address). This eliminates the need for Spectre mitigations (e.g. RETPOLINE) for these indirect calls, and avoids a memory lookup for the pointer. For hot-path code (like the scheduler), this has a measurable performance impact. It also serves as a kind of Control Flow Integrity implementation: an indirect call got removed, and the potential destinations have been explicitly identified at compile-time. network RNG improvements
In an effort to improve the pseudo-random number generator used by the network subsystem (for things like port numbers and packet sequence numbers), Linux s home-grown pRNG has been replaced by the SipHash round function, and perturbed by (hopefully) hard-to-predict internal kernel states. This should make it very hard to brute force the internal state of the pRNG and make predictions about future random numbers just from examining network traffic. Similarly, ICMP s global rate limiter was adjusted to avoid leaking details of network state, as a start to fixing recent DNS Cache Poisoning attacks. SafeSetID handles GID
Thomas Cedeno improved the SafeSetID LSM to handle group IDs (which required teaching the kernel about which syscalls were actually performing setgid.) Like the earlier setuid policy, this lets the system owner define an explicit list of allowed group ID transitions under CAP_SETGID (instead of to just any group), providing a way to keep the power of granting this capability much more limited. (This isn t complete yet, though, since handling setgroups() is still needed.) improve kernel s internal checking of file contents
The kernel provides LSMs (like the Integrity subsystem) with details about files as they re loaded. (For example, loading modules, new kernel images for kexec, and firmware.) There wasn t very good coverage for cases where the contents were coming from things that weren t files. To deal with this, new hooks were added that allow the LSMs to introspect the contents directly, and to do partial reads. This will give the LSMs much finer grain visibility into these kinds of operations. set_fs removal continues
With the earlier work landed to free the core kernel code from set_fs(), Christoph Hellwig made it possible for set_fs() to be optional for an architecture. Subsequently, he then removed set_fs() entirely for x86, riscv, and powerpc. These architectures will now be free from the entire class of kernel address limit attacks that only needed to corrupt a single value in struct thead_info. sysfs_emit() replaces sprintf() in /sys
Joe Perches tackled one of the most common bug classes with sprintf() and snprintf() in /sys handlers by creating a new helper, sysfs_emit(). This will handle the cases where kernel code was not correctly dealing with the length results from sprintf() calls, which might lead to buffer overflows in the PAGE_SIZE buffer that /sys handlers operate on. With the helper in place, it was possible to start the refactoring of the many sprintf() callers. nosymfollow mount option
Mattias Nissler and Ross Zwisler implemented the nosymfollow mount option. This entirely disables symlink resolution for the given filesystem, similar to other mount options where noexec disallows execve(), nosuid disallows setid bits, and nodev disallows device files. Quoting the patch, it is useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. (i.e. for when /proc/sys/fs/protected_symlinks isn t a big enough hammer.) Chrome OS uses this option for its stateful filesystem, as symlink traversal as been a common attack-persistence vector. ARMv8.5 Memory Tagging Extension support
Vincenzo Frascino added support to arm64 for the coming Memory Tagging Extension, which will be available for ARMv8.5 and later chips. It provides 4 bits of tags (covering multiples of 16 byte spans of the address space). This is enough to deterministically eliminate all linear heap buffer overflow flaws (1 tag for free , and then rotate even values and odd values for neighboring allocations), which is probably one of the most common bugs being currently exploited. It also makes use-after-free and over/under indexing much more difficult for attackers (but still possible if the target s tag bits can be exposed). Maybe some day we can switch to 128 bit virtual memory addresses and have fully versioned allocations. But for now, 16 tag values is better than none, though we do still need to wait for anyone to actually be shipping ARMv8.5 hardware. fixes for flaws found by UBSAN
The work to make UBSAN generally usable under syzkaller continues to bear fruit, with various fixes all over the kernel for stuff like shift-out-of-bounds, divide-by-zero, and integer overflow. Seeing these kinds of patches land reinforces the the rationale of shifting the burden of these kinds of checks to the toolchain: these run-time bugs continue to pop up. flexible array conversions
The work on flexible array conversions continues. Gustavo A. R. Silva and others continued to grind on the conversions, getting the kernel ever closer to being able to enable the -Warray-bounds compiler flag and clear the path for saner bounds checking of array indexes and memcpy() usage. That s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.11.

2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

28 March 2022

Antoine Beaupr : What is going on with web servers

I stumbled upon this graph recently, which is w3techs.com graph of "Historical yearly trends in the usage statistics of web servers". It seems I hadn't looked at it in a long while because I was surprised at many levels:
  1. Apache is now second, behind Nginx, since ~2022 (so that's really new at least)
  2. Cloudflare "server" is third ahead of the traditional third (Microsoft IIS) - I somewhat knew that Cloudflare was hosting a lot of stuff, but I somehow didn't expect to see it there at all for some reason
  3. I had to lookup what LiteSpeed was (and it's not a bike company): it's a drop-in replacement (!?) of the Apache web server (not a fork, a bizarre idea but which seems to be gaining a lot of speed recently, possibly because of its support for QUIC and HTTP/3
So there. I'm surprised. I guess the stats should be taken with a grain of salt because they only partially correlate with Netcraft's numbers which barely mention LiteSpeed at all. (But they do point at a rising share as well.) Netcraft also puts Nginx's first place earlier in time, around April 2019, which is about when Microsoft's IIS took a massive plunge down. That is another thing that doesn't map with w3techs' graphs at all either. So it's all lies and statistics, basically. Moving on. Oh and of course, the two first most popular web servers, regardless of the source, are package in Debian. So while we're working on statistics and just making stuff up, I'm going to go ahead and claim all of this stuff runs on Linux and that BSD is dead. Or something like that.

31 January 2022

Ingo Juergensmann: XMPP and Mail Clients

I really like XMPP, but I m a little unhappy about the current general situation of XMPP. I think XMPP could do better if there were some benefits of having an XMPP address. For me one of those benefits is to have the option to have just one address I need to communicate to others. If everything is in place and well-configured, a user can be reached by mail, XMPP and SIP (voice/video calls) by just one address. To address this I would like to see XMPP support in mail clients (MUAs). So when you reply to a mail or write a new one, the client will do a lookup in your addressbook if the address has an XMPP field associated with it and (if not) do a DNS lookup for _xmpp-server._tcp.example.com (or the matching domain part of recipients address). If there is an XMPP address listed in mail header, that JID will be used. When the lookup is successful and an xmpp: protocol handler is configured in the system, the MUA offers an option to begin a chat with the recipient and/or displays the presence status of the recipients (depending on available web-presence or presence subscription). Basically a good candidate could be Thunderbird, because it already has XMPP support built in, albeit not a good implementation and lacking many modern features like OMEMO. But for basic functions (like presence status and such) it should be sufficient for a start. Other candidates could be Evolution, Kmail (as KDE MUA and Kaidan as a native KDE XMPP client) or even Apple Mail.app, because Apples addressbook supports XMPP fields for each contact. Basically the same could be done for SIP contacts: if a SIP SRV record for that domain does exist, the MUA could offer an option to call the recipient. I would be willing to give some money via Bountysource or similar platforms. Is anyone aware of such a project or willing to write some addons? Maybe within the GSoC? PS: there is RFC7259 about Jabber/XMPP JID in mail headers and there is also a page in the XMPP.org wiki.

9 December 2021

David Kalnischkies: APT for Advent of Code

Screenshot of my Advent of Code 2021 status page as of today Advent of Code 2021
Advent of Code, for those not in the know, is a yearly Advent calendar (since 2015) of coding puzzles many people participate in for a plenary of reasons ranging from speed coding to code golf with stops at learning a new language or practicing already known ones. I usually write boring C++, but any language and then some can be used. There are reports of people implementing it in hardware, solving them by hand on paper or using Microsoft Excel so, after solving a puzzle the easy way yesterday, this time I thought: CHALLENGE ACCEPTED! as I somehow remembered an old 2008 article about solving Sudoku with aptitude (Daniel Burrows via archive.org as the blog is long gone) and the good same old a package management system that can solve [puzzles] based on package dependency rules is not something that I think would be useful or worth having (Russell Coker). Day 8 has a rather lengthy problem description and can reasonably be approached in a bunch of different way. One unreasonable approach might be to massage the problem description into Debian packages and let apt help me solve the problem (specifically Part 2, which you unlock by solving Part 1. You can do that now, I will wait here.) Be warned: I am spoiling Part 2 in the following, so solve it yourself first if you are interested. I will try to be reasonable consistent in naming things in the following and so have chosen: The input we get are lines like acedgfb cdfbe gcdfa fbcad dab cefabd cdfgeb eafb cagedb ab cdfeb fcadb cdfeb cdbaf. The letters are wires mixed up and connected to the segments of the displays: A group of these letters is hence a digit (the first 10) which represent one of the digits 0 to 9 and (after the pipe) the four displays which match (after sorting) one of the digits which means this display shows this digit. We are interested in which digits are displayed to solve the puzzle. To help us we also know which segments form which digit, we just don't know the wiring in the back. So we should identify which wire maps to which segment! We are introducing the packages wire-X-connects-to-Y for this which each provide & conflict1 with the virtual packages segment-Y and wire-X-connects. The later ensures that for a given wire we can only pick one segment and the former ensures that not multiple wires map onto the same segment. As an example: wire a's possible association with segment b is described as:
Package: wire-a-connects-to-b
Provides: segment-b, wire-a-connects
Conflicts: segment-b, wire-a-connects
Note that we do not know if this is true! We generate packages for all possible (and then some) combinations and hope dependency resolution will solve the problem for us. So don't worry, the hard part will be done by apt, we just have to provide all (im)possibilities! What we need now is to translate the 10 digits (and 4 outputs) from something like acedgfb into digit-0-is-eight and not, say digit-0-is-one. A clever solution might realize that a one consists only of two segments so a digit wiring up seven segments can not be a 1 (and must be 8 instead), but again we aren't here to be clever: We want apt to figure that out for us! So what we do is simply making every digit-0-is-N (im)possible choice available as a package and apply constraints: A given digit-N can only display one number and each N is unique as digit so for both we deploy Provides & Conflicts again. We also need to reason about the segments in the digits: Each of the digit packages gets Depends on wire-X-connects-to-Y where X is each possible wire (e.g. acedgfb) and Y each segment forming the digit (e.g. cf for one). The different choices for X are or'ed together, so that either of them satisfies the Y. We know something else too through: The segments which are not used by the digit can not be wired to any of the Xs. We model this with Conflicts on wire-X-connects-to-Y. As an example: If digit-0s acedgfb would be displaying a one (remember, it can't) the following package would be installable:
Package: digit-0-is-one
Version: 1
Depends: wire-a-connects-to-c   wire-c-connects-to-c   wire-e-connects-to-c   wire-d-connects-to-c   wire-g-connects-to-c   wire-f-connects-to-c   wire-b-connects-to-c,
         wire-a-connects-to-f   wire-c-connects-to-f   wire-e-connects-to-f   wire-d-connects-to-f   wire-g-connects-to-f   wire-f-connects-to-f   wire-b-connects-to-f
Provides: digit-0, digit-is-one
Conflicts: digit-0, digit-is-one,
  wire-a-connects-to-a, wire-c-connects-to-a, wire-e-connects-to-a, wire-d-connects-to-a, wire-g-connects-to-a, wire-f-connects-to-a, wire-b-connects-to-a,
  wire-a-connects-to-b, wire-c-connects-to-b, wire-e-connects-to-b, wire-d-connects-to-b, wire-g-connects-to-b, wire-f-connects-to-b, wire-b-connects-to-b,
  wire-a-connects-to-d, wire-c-connects-to-d, wire-e-connects-to-d, wire-d-connects-to-d, wire-g-connects-to-d, wire-f-connects-to-d, wire-b-connects-to-d,
  wire-a-connects-to-e, wire-c-connects-to-e, wire-e-connects-to-e, wire-d-connects-to-e, wire-g-connects-to-e, wire-f-connects-to-e, wire-b-connects-to-e,
  wire-a-connects-to-g, wire-c-connects-to-g, wire-e-connects-to-g, wire-d-connects-to-g, wire-g-connects-to-g, wire-f-connects-to-g, wire-b-connects-to-g
Repeat such stanzas for all 10 possible digits for digit-0 and then repeat this for all the other nine digit-N. We produce pretty much the same stanzas for display-0(-is-one), just that we omit the second Provides & Conflicts from above (digit-is-one) as in the display digits can be repeated. The rest is the same (modulo using display instead of digit as name of course). Lastly we create a package dubbed solution which depends on all 10 digit-N and 4 display-N all of them virtual packages apt will have to choose an installable provider from and we are nearly done! The resulting Packages file2 we can give to apt while requesting to install the package solution and it will spit out not only the display values we are interested in but also which number each digit represents and which wire is connected to which segment. Nifty!
$ ./skip-aoc 'acedgfb cdfbe gcdfa fbcad dab cefabd cdfgeb eafb cagedb ab   cdfeb fcadb cdfeb cdbaf'
[ ]
The following additional packages will be installed:
  digit-0-is-eight digit-1-is-five digit-2-is-two digit-3-is-three
  digit-4-is-seven digit-5-is-nine digit-6-is-six digit-7-is-four
  digit-8-is-zero digit-9-is-one display-1-is-five display-2-is-three
  display-3-is-five display-4-is-three wire-a-connects-to-c
  wire-b-connects-to-f wire-c-connects-to-g wire-d-connects-to-a
  wire-e-connects-to-b wire-f-connects-to-d wire-g-connects-to-e
[ ]
0 upgraded, 22 newly installed, 0 to remove and 0 not upgraded.
We are only interested in the numbers on the display through, so grepping the apt output (-V is our friend here) a bit should let us end up with what we need as calculating3 is (unsurprisingly) not a strong suit of our package relationship language so we need a few shell commands to help us with the rest.
$ ./skip-aoc 'acedgfb cdfbe gcdfa fbcad dab cefabd cdfgeb eafb cagedb ab   cdfeb fcadb cdfeb cdbaf' -qq
5353
I have written the skip-aoc script as a testcase for apt, so to run it you need to place it in /path/to/source/of/apt/test/integration and built apt first, but that is only due to my laziness. We could write a standalone script interfacing with the system installed apt directly and in any apt version since ~2011. To hand in the solution for the puzzle we just need to run this on each line of the input (~200 lines) and add all numbers together. In other words: Behold this beautiful shell one-liner: parallel -I ' ' ./skip-aoc ' ' -qq < input.txt paste -s -d'+' - bc (You may want to run parallel with -P to properly grill your CPU as that process can take a while otherwise and it still does anyhow as I haven't optimized it at all the testing framework does a lot of pointless things wasting time here, but we aren't aiming for the leaderboard so ) That might or even likely will fail through as I have so far omitted a not unimportant detail: The default APT resolver is not able to solve this puzzle with the given problem description we need another solver! Thankfully that is as easy as installing apt-cudf (and with it aspcud) which the script is using via --solver aspcud to make apt hand over the puzzle to a "proper" solver (or better: A solver who is supposed to be good at "answering set" questions). The buildds are using this for experimental and/or backports builds and also for installability checks via dose3 btw, so you might have encountered it before. Be careful however: Just because aspcud can solve this puzzle doesn't mean it is a good default resolver for your day to day apt. One of the reasons the default resolver has such a hard time solving this here is that or-groups have usually an order in which the first is preferred over every later option and so fort. This is of no concern here as all these alternatives will collapse to a single solution anyhow, but if there are multiple viable solutions (which is often the case) picking the "wrong" alternative can have bad consequences. A classic example would be exim4 postfix nullmailer. They are all MTAs but behave very different. The non-default solvers also tend to lack certain features like keeping track of auto-installed packages or installing Recommends/Suggests. That said, Julian is working on another solver as I write this which might deal with more of these issues. And lastly: I am also relatively sure that with a bit of massaging the default resolver could be made to understand the problem, but I can't play all day with this maybe some other day. Disclaimer: Originally posted in the daily megathread on reddit, the version here is just slightly better understandable as I have hopefully renamed all the packages to have more conventional names and tried to explain what I am actually doing. No cows were harmed in this improved version, either.

  1. If you would upload those packages somewhere, it would be good style to add Replaces as well, but it is of minor concern for apt so I am leaving them out here for readability.
  2. We have generated 49 wires, 100 digits, 40 display and 1 solution package for a grant total of 190 packages. We are also making use of a few purely virtual ones, but that doesn't add up to many packages in total. So few packages are practically childs play for apt given it usually deals with thousand times more. The instability for those packages tends to be a lot better through as only 22 of 190 packages we generated can (and will) be installed. Britney will hate you if your uploads to Debian unstable are even remotely as bad as this.
  3. What we could do is introduce 10.000 packages which denote every possible display value from 0000 to 9999. We would then need to duplicate our 10.190 packages for each line (namespace them) and then add a bit more than a million packages with the correct dependencies for summing up the individual packages for apt to be able to display the final result all by itself. That would take a while through as at that point we are looking at working with ~22 million packages with a gazillion amount of dependencies probably overworking every solver we would throw at it a bit of shell glue seems the better option for now.
This article was written by David Kalnischkies on apt-get a life and republished here by pulling it from a syndication feed. You should check there for updates and more articles about apt and EDSP.

14 November 2021

Junichi Uekawa: I tried making a graph of deductions for my tax return.

I tried making a graph of deductions for my tax return. Mostly because I was curious how the curve looks like when income increases. It's nice that I get plotly to plot the graph in milliseconds. But the amount of required input makes it really hard to get right and I think something is still off. graph of income tax. It was interesting that it doesn't look like a curve, mostly a straight line.

8 November 2021

Enrico Zini: An educational debugging session

This morning we realised that a test case failed on Fedora 34 only (the link is in Italian) and we set to debugging. The initial analysis This is the initial reproducer:
$ PROJ_DEBUG=3 python setup.py test
test_recipe (tests.test_litota3.TestLITOTA3NordArkimetIFS) ... pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: Open of /lib64/../share/proj/proj.db failed
pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: no database context specified
Cannot instantiate source_crs
EXCEPTION in py_coast(): ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
ERROR
Note that opening /lib64/../share/proj/proj.db sometimes succeeds, sometimes fails. It's some kind of Schr dinger path, which works or not depending on how you observe it:
# ls -lad /lib64
lrwxrwxrwx 1 1000 1000 9 Jan 26  2021 /lib64 -> usr/lib64
$ ls -la /lib64/../share/proj/proj.db
-rw-r--r-- 1 root root 8925184 Jan 28  2021 /lib64/../share/proj/proj.db
$ cd /lib64/../share/proj/
$ cd /lib64
$ cd ..
$ cd share
-bash: cd: share: No such file or directory
And indeed, stat(2) finds it, and sqlite doesn't (the file is a sqlite database):
$ stat /lib64/../share/proj/proj.db
  File: /lib64/../share/proj/proj.db
  Size: 8925184     Blocks: 17432      IO Block: 4096   regular file
Device: 33h/51d Inode: 56907       Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-11-08 14:09:12.334350779 +0100
Modify: 2021-01-28 05:38:11.000000000 +0100
Change: 2021-11-08 13:42:51.758874327 +0100
 Birth: 2021-11-08 13:42:51.710874051 +0100
$ sqlite3 /lib64/../share/proj/proj.db
Error: unable to open database "/lib64/../share/proj/proj.db": unable to open database file
A minimal reproducer Later on we started stripping layers of code towards a minimal reproducer: here it is. It works or doesn't work depending on whether proj is linked explicitly, or via MagPlus:
$ cat tc.cc
#include <magics/ProjP.h>
int main()  
    magics::ProjP p("EPSG:4326", "+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs");
    return 0;
 
$ g++ -o tc  tc.cc -I/usr/include/magics  -lMagPlus
$ ./tc
proj_create: Open of /lib64/../share/proj/proj.db failed
proj_create: no database context specified
terminate called after throwing an instance of 'magics::MagicsException'
  what():  ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
Aborted (core dumped)
$ g++ -o tc  tc.cc -I/usr/include/magics -lproj  -lMagPlus
$ ./tc
What is going on here? A difference between the two is the path used to link to libproj.so:
$ ldd ./tc   grep proj
    libproj.so.19 => /lib64/libproj.so.19 (0x00007fd4919fb000)
$ g++ -o tc  tc.cc -I/usr/include/magics   -lMagPlus
$ ldd ./tc   grep proj
    libproj.so.19 => /lib64/../lib64/libproj.so.19 (0x00007f6d1051b000)
Common sense screams that this should not matter, but we chased an intuition and found that one of the ways proj looks for its database is relative to its shared library. Indeed, gdb in hand, that dladdr call returns /lib64/../lib64/libproj.so.19. From /lib64/../lib64/libproj.so.19, proj strips two paths from the end, presumably to pass from something like /something/usr/lib/libproj.so to /something/usr. So, dladdr returns /lib64/../lib64/libproj.so.19, which becomes /lib64/../, which becomes /lib64/../share/proj/proj.db, which exists on the file system and is used as a path to the database. But depending how you look at it, that path might or might not be valid: it passes the stat(2) check that stops the lookup for candidate paths, but sqlite is unable to open it. Why does the other path work? By linking libproj.so in the other way, dladdr returns /lib64/libproj.so.19, which becomes /share/proj/proj.db, which doesn't exist, which triggers a fallback to a PROJ_LIB constant defined at compile time, which is a path that works no matter how you look at it. Why that weird path with libMagPlus? To complete the picture, we found that libMagPlus.so is packaged with a rpath set, which is known to cause trouble
# readelf -d /usr/lib64/libMagPlus.so grep rpath
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib64]
The workaround We found that one can set PROJ_LIB in the environment to override the normal proj database lookup. Building on that, we came up with a simple way to override it on Fedora 34 only:
    if distro is not None and distro.linux_distribution()[:2] == ("Fedora", "34") and "PROJ_LIB" not in os.environ:
         self.env_overrides["PROJ_LIB"] = "/usr/share/proj/"
This has been a most edifying and educational debugging session, with only the necessary modicum of curses and swearwords. Working in a team of excellent people really helps.

4 November 2021

Antoine Beaupr : A Python contextmanager gotcha

Dear lazy web... I've had this code sitting around as a wtf.py for a while. I've been meaning to understand what's going on and write a blog post about it for a while, but I'm lacking the time. Now that I have a few minutes, I actually sat down to look at it and I think I figured it out:
from contextlib import contextmanager
@contextmanager
def bad():
    print(&aposin the context manager&apos)
    try:
        print("yielding value")
        yield &aposvalue&apos
    finally:
        return print(&aposcleaning up&apos)
@contextmanager
def good():
    print(&aposin the context manager&apos)
    try:
        print("yielding value")
        yield &aposvalue&apos
    finally:
        print(&aposcleaning up&apos)
with bad() as v:
    print(&aposgot v = %s&apos % v)
    raise Exception(&aposexception not raised!&apos)  # SILENCED!
print("this code is reached")
with good() as v:
    print(&aposgot v = %s&apos % v)
    raise Exception(&aposexpection normally raised&apos)
print("NOT REACHED (expected)")
For those, like me, who need a walkthrough, here's what the above does:
  1. define a bad context manager (the things you use with with statements) with contextlib.contextmanager) which:
    1. prints a debug statement
    2. return a value
    3. then returns and prints a debug statement
  2. define a good context manager in much the same way, except it doesn't return, it just prints statement
  3. use the bad context manager to show how it bypasses an exception
  4. use the good context manager to show how it correctly raises the exception
The output of this code (in Debian 11 bullseye, Python 3.9.2) is:
in the context manager
yielding value
got v = value
cleaning up
this code is reached
in the context manager
yielding value
got v = value
cleaning up
Traceback (most recent call last):
  File "/home/anarcat/wikis/anarc.at/wtf.py", line 31, in <module>
    raise Exception('expection normally raised')
Exception: expection normally raised
What is surprising to me, with this code, is not only does the exception not get raised, but also the return statement doesn't seem to actually execute, or at least not in the parent scope: if it would, this code is reached wouldn't be printed and the rest of the code wouldn't run either. So what's going on here? Now I know that I should be careful with return in my context manager, but why? And why is it silencing the exception? The reason why it's being silenced is this little chunk in the with documentation:
If the suite was exited due to an exception, and the return value from the exit() method was false, the exception is reraised. If the return value was true, the exception is suppressed, and execution continues with the statement following the with statement.
This feels a little too magic. If you write a context manager with __exit__(), you're kind of forced to lookup again what that API is. But the contextmanager decorator hides that away and it's easy to make that mistake... Credits to the Python tips book for teaching me about that trick in the first place.

18 October 2021

Dirk Eddelbuettel: dang 0.0.14: Several Updates

A new release of the dang package arrived at CRAN a couple of hours ago, exactly eight months after the previous release. The dang package regroups a few functions of mine that had no other home as for example lsos() from a StackOverflow question from 2009 (!!), the overbought/oversold price band plotter from an older blog post, the market monitor from the last release as well the checkCRANStatus() function recently tweeted about by Tim Taylor. This release regroups a few small edits to several functions, adds a sample function for character encoding reading and conversion using a library already used by R (hence look Ma, no new depends ), adds a weekday helper, and a sample usage (computing rolling min/max values) of a new simple vector class added to tidyCpp (and the function and class need to get another blog post or study ), and an experimental git sha1sum and date marker (as I am not the fan of autogenerated binaries from repos as opposed to marked released meaning: we may see different binary release with the same version number). The full NEWS entry follows.

Changes in version 0.0.14 (2021-10-17)
  • Updated continuous integration to run on Linux only.
  • Edited checkNonAscii.cpp for readability.
  • More robust title display in intradayMarketMonitor.R.
  • New C++-based function to read and convert encoding via the R-supplied iconv library, noted a potential variability.
  • New function wday returning day of the week as integer.
  • The signature to as.data.table was standardized.
  • A new function rollMinMax was added illustrating use of the NumVec class from tidyCpp.
  • The configure script can record the last commit date and sha1 to automate timestamping builds, but not activated in this release.
  • checkCRANStatus() now works correctly for single-package lookups (Jordan Mark Barbone in #4).

Courtesy of my CRANberries, there is a comparison to the previous release. For questions or comments use the issue tracker off the GitHub repo. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

2 October 2021

Fran ois Marier: Setting up a JMP SIP account on Asterisk

JMP offers VoIP calling via XMPP, but it's also possibly to use the VoIP using SIP. The underlying VoIP calling functionality in JMP is provided by Bandwidth, but their old Asterisk instructions didn't quite work for me. Here's how I set it up in my Asterisk server.

Get your SIP credentials After signing up for JMP and setting it up in your favourite XMPP client, send the following message to the cheogram.com gateway contact:
reset sip account
In response, you will receive a message containing:
  • a numerical username
  • a password (e.g. three lowercase words separated by spaces)

Add SIP account to your Asterisk config First of all, I added the following near the top of my /etc/asterisk/sip.conf:
[general]
register => username:three secret words@jmp.cbcbc7.auth.bandwidth.com:5008
The other non-default options I have set in [general] are:
context=public
allowoverlap=no
udpbindaddr=0.0.0.0
tcpenable=yes
tcpbindaddr=0.0.0.0
tlsenable=yes
transport=udp
srvlookup=no
vmexten=voicemail
relaxdtmf=yes
useragent=Asterisk PBX
tlscertfile=/etc/asterisk/asterisk.cert
tlsprivatekey=/etc/asterisk/asterisk.key
tlscapath=/etc/ssl/certs/
externhost=machinename.dyndns.org
localnet=192.168.0.0/255.255.0.0
Note that you can have more than one register line in your config if you use more than one SIP provider, but you must register with the server whether you want to receive incoming calls or not. Then I added a new blurb to the bottom of the same file:
[jmp]
type=peer
host=mp.cbcbc7.auth.bandwidth.com
port=5008
secret=three secret words
defaultuser=username
context=from-jmp
disallow=all
allow=ulaw
allow=g729
insecure=port,invite
canreinvite=no
dtmfmode=rfc2833
and for reference, here's the blurb for my Snom 300 SIP phone:
[1001]
; Snom 300
type=friend
qualify=yes
secret=password
encryption=no
context=full
host=dynamic
nat=no
directmedia=no
mailbox=1000@internal
vmexten=707
dtmfmode=rfc2833
call-limit=2
disallow=all
allow=g722
allow=ulaw
I checked that the registration was successful by running asterisk -r and then typing:
sip set debug on
before reloading the configuration using:
reload

Create Asterisk extensions to send and receive calls Once I got registration to work, I hooked this up with my other extensions so that I could send and receive calls using my JMP number. In /etc/asterisk/extensions.conf, I added the following:
[from-jmp]
include => home
exten => s,1,Goto(1000,1)
where home is the context which includes my local SIP devices and 1000 is the extension I want to ring. Then I added the following to enable calls to any destination within the North American Numbering Plan:
[pstn-jmp]
exten => _1NXXNXXXXXX,1,Set(CALLERID(all)=Francois Marier <5551231434>)
exten => _1NXXNXXXXXX,n,Dial(SIP/jmp/$ EXTEN )
exten => _1NXXNXXXXXX,n,Hangup()
exten => _NXXNXXXXXX,1,Set(CALLERID(all)=Francois Marier <5551231234>)
exten => _NXXNXXXXXX,n,Dial(SIP/jmp/1$ EXTEN )
exten => _NXXNXXXXXX,n,Hangup()
Here 5551231234 is my JMP phone number, not my bwsip numerical username. For reference, here's the rest of my dialplan in /etc/asterisk/extensions.conf:
[general]
static=yes
writeprotect=no
clearglobalvars=no
[public]
exten => _X.,1,Hangup(3)
[sipdefault]
exten => _X.,1,Hangup(3)
[default]
exten => _X.,1,Hangup(3)
[internal]
include => home
[full]
include => internal
include => pstn-jmp
exten => 707,1,VoiceMailMain(1000@internal)
[home]
; Internal extensions
exten => 1000,1,Dial(SIP/1001,20)
exten => 1000,n,Goto(in1000-$ DIALSTATUS ,1)
exten => 1000,n,Hangup
exten => in1000-BUSY,1,Hangup(17)
exten => in1000-CONGESTION,1,Hangup(3)
exten => in1000-CHANUNAVAIL,1,VoiceMail(1000@internal,su)
exten => in1000-CHANUNAVAIL,n,Hangup(3)
exten => in1000-NOANSWER,1,VoiceMail(1000@internal,su)
exten => in1000-NOANSWER,n,Hangup(16)
exten => _in1000-.,1,Hangup(16)

Firewall Finally, I opened a few ports in my firewall by putting the following in /etc/network/iptables.up.rules:
# SIP and RTP on UDP (jmp.cbcbc7.auth.bandwidth.com)
-A INPUT -s 67.231.2.13/32 -p udp --dport 5008 -j ACCEPT
-A INPUT -s 216.82.238.135/32 -p udp --dport 5008 -j ACCEPT
-A INPUT -s 67.231.2.13/32 -p udp --sport 5004:5005 --dport 10001:20000 -j ACCEPT
-A INPUT -s 216.82.238.135/32 -p udp --sport 5004:5005 --dport 10001:20000 -j ACCEPT

Outbound calls not working While the above setup works for me for inbound calls, it doesn't currently work for outbound calls. The hostname currently resolves to one of two IP addresses:
$ dig +short jmp.cbcbc7.auth.bandwidth.com
67.231.2.13
216.82.238.135
If I pin it to the first one by putting the following in my /etc/hosts file:
67.231.2.13 jmp.cbcbc7.auth.bandwidth.com
then I get a 486 error back from the server when I dial 1-555-456-4567:
<--- SIP read from UDP:67.231.2.13:5008 --->
SIP/2.0 486 Busy Here
Via: SIP/2.0/UDP 127.0.0.1:5060;branch=z9hG4bK03210a30
From: "Francois Marier" <sip:5551231234@127.0.0.1>
To: <sip:15554564567@jmp.cbcbc7.auth.bandwidth.com:5008>
Call-ID: 575f21f36f57951638c1a8062f3a5201@127.0.0.1:5060
CSeq: 103 INVITE
Content-Length: 0
On the other hand, if I pin it to 216.82.238.135, then I get a 600 error:
<--- SIP read from UDP:216.82.238.135:5008 --->
SIP/2.0 600 Busy Everywhere
Via: SIP/2.0/UDP 127.0.0.1:5060;branch=z9hG4bK7b7f7ed9
From: "Francois Marier" <sip:5551231234@127.0.0.1>
To: <sip:15554564567@jmp.cbcbc7.auth.bandwidth.com:5008>
Call-ID: 5bebb8d05902c1732c6b9f4776844c66@127.0.0.1:5060
CSeq: 103 INVITE
Content-Length: 0
If you have any idea what might be wrong here, or if you got outbound calls to work on Bandwidth.com, please leave a comment!

6 September 2021

Vincent Bernat: Switching to the i3 window manager

I have been using the awesome window manager for 10 years. It is a tiling window manager, configurable and extendable with the Lua language. Using a general-purpose programming language to configure every aspect is a double-edged sword. Due to laziness and the apparent difficulty of adapting my configuration about 3000 lines to newer releases, I was stuck with the 3.4 version, whose last release is from 2013. It was time for a rewrite. Instead, I have switched to the i3 window manager, lured by the possibility to migrate to Wayland and Sway later with minimal pain. Using an embedded interpreter for configuration is not as important to me as it was in the past: it brings both complexity and brittleness.
i3 dual screen setup
Dual screen desktop running i3, Emacs, some terminals, including a Quake console, Firefox, Polybar as the status bar, and Dunst as the notification daemon.
The window manager is only one part of a desktop environment. There are several options for the other components. I am also introducing them in this post.

i3: the window manager i3 aims to be a minimal tiling window manager. Its documentation can be read from top to bottom in less than an hour. i3 organize windows in a tree. Each non-leaf node contains one or several windows and has an orientation and a layout. This information arbitrates the window positions. i3 features three layouts: split, stacking, and tabbed. They are demonstrated in the below screenshot:
Example of layouts
Demonstration of the layouts available in i3. The main container is split horizontally. The first child is split vertically. The second one is tabbed. The last one is stacking.
Tree representation of the previous screenshot
Tree representation of the previous screenshot.
Most of the other tiling window managers, including the awesome window manager, use predefined layouts. They usually feature a large area for the main window and another area divided among the remaining windows. These layouts can be tuned a bit, but you mostly stick to a couple of them. When a new window is added, the behavior is quite predictable. Moreover, you can cycle through the various windows without thinking too much as they are ordered. i3 is more flexible with its ability to build any layout on the fly, it can feel quite overwhelming as you need to visualize the tree in your head. At first, it is not unusual to find yourself with a complex tree with many useless nested containers. Moreover, you have to navigate windows using directions. It takes some time to get used to. I set up a split layout for Emacs and a few terminals, but most of the other workspaces are using a tabbed layout. I don t use the stacking layout. You can find many scripts trying to emulate other tiling window managers but I did try to get my setup pristine of these tentatives and get a chance to familiarize myself. i3 can also save and restore layouts, which is quite a powerful feature. My configuration is quite similar to the default one and has less than 200 lines.

i3 companion: the missing bits i3 philosophy is to keep a minimal core and let the user implements missing features using the IPC protocol:
Do not add further complexity when it can be avoided. We are generally happy with the feature set of i3 and instead focus on fixing bugs and maintaining it for stability. New features will therefore only be considered if the benefit outweighs the additional complexity, and we encourage users to implement features using the IPC whenever possible. Introduction to the i3 window manager
While this is not as powerful as an embedded language, it is enough for many cases. Moreover, as high-level features may be opinionated, delegating them to small, loosely coupled pieces of code keeps them more maintainable. Libraries exist for this purpose in several languages. Users have published many scripts to extend i3: automatic layout and window promotion to mimic the behavior of other tiling window managers, window swallowing to put a new app on top of the terminal launching it, and cycling between windows with Alt+Tab. Instead of maintaining a script for each feature, I have centralized everything into a single Python process, i3-companion using asyncio and the i3ipc-python library. Each feature is self-contained into a function. It implements the following components:
make a workspace exclusive to an application
When a workspace contains Emacs or Firefox, I would like other applications to move to another workspace, except for the terminal which is allowed to intrude into any workspace. The workspace_exclusive() function monitors new windows and moves them if needed to an empty workspace or to one with the same application already running.
implement a Quake console
The quake_console() function implements a drop-down console available from any workspace. It can be toggled with Mod+ . This is implemented as a scratchpad window.
back and forth workspace switching on the same output
With the workspace back_and_forth command, we can ask i3 to switch to the previous workspace. However, this feature is not restricted to the current output. I prefer to have one keybinding to switch to the workspace on the next output and one keybinding to switch to the previous workspace on the same output. This behavior is implemented in the previous_workspace() function by keeping a per-output history of the focused workspaces.
create a new empty workspace or move a window to an empty workspace
To create a new empty workspace or move a window to an empty workspace, you have to locate a free slot and use workspace number 4 or move container to workspace number 4. The new_workspace() function finds a free number and use it as the target workspace.
restart some services on output change
When adding or removing an output, some actions need to be executed: refresh the wallpaper, restart some components unable to adapt their configuration on their own, etc. i3 triggers an event for this purpose. The output_update() function also takes an extra step to coalesce multiple consecutive events and to check if there is a real change with the low-level library xcffib.
I will detail the other features as this post goes on. On the technical side, each function is decorated with the events it should react to:
@on(CommandEvent("previous-workspace"), I3Event.WORKSPACE_FOCUS)
async def previous_workspace(i3, event):
    """Go to previous workspace on the same output."""
The CommandEvent() event class is my way to send a command to the companion, using either i3-msg -t send_tick or binding a key to a nop command. The latter is used to avoid spawning a shell and a i3-msg process just to send a message. The companion listens to binding events and checks if this is a nop command.
bindsym $mod+Tab nop "previous-workspace"
There are other decorators to avoid code duplication: @debounce() to coalesce multiple consecutive calls, @static() to define a static variable, and @retry() to retry a function on failure. The whole script is a bit more than 1000 lines. I think this is worth a read as I am quite happy with the result.

dunst: the notification daemon Unlike the awesome window manager, i3 does not come with a built-in notification system. Dunst is a lightweight notification daemon. I am running a modified version with HiDPI support for X11 and recursive icon lookup. The i3 companion has a helper function, notify(), to send notifications using DBus. container_info() and workspace_info() uses it to display information about the container or the tree for a workspace.
Notification showing i3 tree for a workspace
Notification showing i3 s tree for a workspace

polybar: the status bar i3 bundles i3bar, a versatile status bar, but I have opted for Polybar. A wrapper script runs one instance for each monitor. The first module is the built-in support for i3 workspaces. To not have to remember which application is running in a workspace, the i3 companion renames workspaces to include an icon for each application. This is done in the workspace_rename() function. The icons are from the Font Awesome project. I maintain a mapping between applications and icons. This is a bit cumbersome but it looks great.
i3 workspaces in Polybar
i3 workspaces in Polybar
For CPU, memory, brightness, battery, disk, and audio volume, I am relying on the built-in modules. Polybar s wrapper script generates the list of filesystems to monitor and they get only displayed when available space is low. The battery widget turns red and blinks slowly when running out of power. Check my Polybar configuration for more details.
Various modules for Polybar
Polybar displaying various information: CPU usage, memory usage, screen brightness, battery status, Bluetooth status (with a connected headset), network status (connected to a wireless network and to a VPN), notification status, and speaker volume.
For Bluetooh, network, and notification statuses, I am using Polybar s ipc module: the next version of Polybar can receive an arbitrary text on an IPC socket. The module is defined with a single hook to be executed at the start to restore the latest status.
[module/network]
type = custom/ipc
hook-0 = cat $XDG_RUNTIME_DIR/i3/network.txt 2> /dev/null
initial = 1
It can be updated with polybar-msg action "#network.send.XXXX". In the i3 companion, the @polybar() decorator takes the string returned by a function and pushes the update through the IPC socket. The i3 companion reacts to DBus signals to update the Bluetooth and network icons. The @on() decorator accepts a DBusSignal() object:
@on(
    StartEvent,
    DBusSignal(
        path="/org/bluez",
        interface="org.freedesktop.DBus.Properties",
        member="PropertiesChanged",
        signature="sa sv as",
        onlyif=lambda args: (
            args[0] == "org.bluez.Device1"
            and "Connected" in args[1]
            or args[0] == "org.bluez.Adapter1"
            and "Powered" in args[1]
        ),
    ),
)
@retry(2)
@debounce(0.2)
@polybar("bluetooth")
async def bluetooth_status(i3, event, *args):
    """Update bluetooth status for Polybar."""
The middle of the bar is occupied by the date and a weather forecast. The latest also uses the IPC mechanism, but the source is a Python script triggered by a timer.
Date and weather in Polybar
Current date and weather forecast for the day in Polybar. The data is retrieved with the OpenWeather API.
I don t use the system tray integrated with Polybar. The embedded icons usually look horrible and they all behave differently. A few years back, Gnome has removed the system tray. Most of the problems are fixed by the DBus-based Status Notifier Item protocol also known as Application Indicators or Ayatana Indicators for GNOME. However, Polybar does not support this protocol. In the i3 companion, The implementation of Bluetooth and network icons, including displaying notifications on change, takes about 200 lines. I got to learn a bit about how DBus works and I get exactly the info I want.

picom: the compositor I like having slightly transparent backgrounds for terminals and to reduce the opacity of unfocused windows. This requires a compositor.1 picom is a lightweight compositor. It works well for me, but it may need some tweaking depending on your graphic card.2 Unlike the awesome window manager, i3 does not handle transparency, so the compositor needs to decide by itself the opacity of each window. Check my configuration for details.

systemd: the service manager I use systemd to start i3 and the various services around it. My xsession script only sets some environment variables and lets systemd handles everything else. Have a look at this article from Micha G ral for the rationale. Notably, each component can be easily restarted and their logs are not mangled inside the ~/.xsession-errors file.3 I am using a two-stage setup: i3.service depends on xsession.target to start services before i3:
[Unit]
Description=X session
BindsTo=graphical-session.target
Wants=autorandr.service
Wants=dunst.socket
Wants=inputplug.service
Wants=picom.service
Wants=pulseaudio.socket
Wants=policykit-agent.service
Wants=redshift.service
Wants=spotify-clean.timer
Wants=ssh-agent.service
Wants=xiccd.service
Wants=xsettingsd.service
Wants=xss-lock.service
Then, i3 executes the second stage by invoking the i3-session.target:
[Unit]
Description=i3 session
BindsTo=graphical-session.target
Wants=wallpaper.service
Wants=wallpaper.timer
Wants=polybar-weather.service
Wants=polybar-weather.timer
Wants=polybar.service
Wants=i3-companion.service
Wants=misc-x.service
Have a look on my configuration files for more details.

rofi: the application launcher Rofi is an application launcher. Its appearance can be customized through a CSS-like language and it comes with several themes. Have a look at my configuration for mine.
Rofi as an application launcher
Rofi as an application launcher
It can also act as a generic menu application. I have a script to control a media player and another one to select the wifi network. It is quite a flexible application.
Rofi as a wifi network selector
Rofi to select a wireless network

xss-lock and i3lock: the screen locker i3lock is a simple screen locker. xss-lock invokes it reliably on inactivity or before a system suspend. For inactivity, it uses the XScreenSaver events. The delay is configured using the xset s command. The locker can be invoked immediately with xset s activate. X11 applications know how to prevent the screen saver from running. I have also developed a small dimmer application that is executed 20 seconds before the locker to give me a chance to move the mouse if I am not away.4 Have a look at my configuration script.
Demonstration of xss-lock, xss-dimmer and i3lock with a 4 speedup.

The remaining components
  • autorandr is a tool to detect the connected display, match them against a set of profiles, and configure them with xrandr.
  • inputplug executes a script for each new mouse and keyboard plugged. This is quite useful to load the appropriate the keyboard map. See my configuration.
  • xsettingsd provides settings to X11 applications, not unlike xrdb but it notifies applications for changes. The main use is to configure the Gtk and DPI settings. See my article on HiDPI support on Linux with X11.
  • Redshift adjusts the color temperature of the screen according to the time of day.
  • maim is a utility to take screenshots. I use Prt Scn to trigger a screenshot of a window or a specific area and Mod+Prt Scn to capture the whole desktop to a file. Check the helper script for details.
  • I have a collection of wallpapers I rotate every hour. A script selects them using advanced machine learning algorithms and stitches them together on multi-screen setups. The selected wallpaper is reused by i3lock.

  1. Apart from the eye candy, a compositor also helps to get tear-free video playbacks.
  2. My configuration works with both Haswell (2014) and Whiskey Lake (2018) Intel GPUs. It also works with AMD GPU based on the Polaris chipset (2017).
  3. You cannot manage two different displays this way e.g. :0 and :1. In the first implementation, I did try to parametrize each service with the associated display, but this is useless: there is only one DBus user session and many services rely on it. For example, you cannot run two notification daemons.
  4. I have only discovered later that XSecureLock ships such a dimmer with a similar implementation. But mine has a cool countdown!

29 June 2021

Enrico Zini: Building a Transilience playbook in a zipapp

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. Mitogen is a great library, but scarily complicated, and I've been wondering how hard it would be to make alternative connection methods for Transilience. Here's a wild idea: can I package a whole Transilience playbook, plus dependencies, in a zipapp, then send the zipapp to the machine to be provisioned, and run it locally? It turns out I can. Creating the zipapp This is somewhat hackish, but until I can rely on Python 3.9's improved importlib.resources module, I cannot think of a better way:
    def zipapp(self, target: str, interpreter=None):
        """
        Bundle this playbook into a self-contained zipapp
        """
        import zipapp
        import jinja2
        import transilience
        if interpreter is None:
            interpreter = sys.executable
        if getattr(transilience.__loader__, "archive", None):
            # Recursively iterating module directories requires Python 3.9+
            raise NotImplementedError("Cannot currently create a zipapp from a zipapp")
        with tempfile.TemporaryDirectory() as workdir:
            # Copy transilience
            shutil.copytree(os.path.dirname(__file__), os.path.join(workdir, "transilience"))
            # Copy jinja2
            shutil.copytree(os.path.dirname(jinja2.__file__), os.path.join(workdir, "jinja2"))
            # Copy argv[0] as __main__.py
            shutil.copy(sys.argv[0], os.path.join(workdir, "__main__.py"))
            # Copy argv[0]/roles
            role_dir = os.path.join(os.path.dirname(sys.argv[0]), "roles")
            if os.path.isdir(role_dir):
                shutil.copytree(role_dir, os.path.join(workdir, "roles"))
            # Turn everything into a zipapp
            zipapp.create_archive(workdir, target, interpreter=interpreter, compressed=True)
Since the zipapp contains not just the playbook, the roles, and the roles' assets, but also Transilience and Jinja2, it can run on any system that has a Python 3.7+ interpreter, and nothing else! I added it to the standard set of playbook command line options, so any Transilience playbook can turn itself into a self-contained zipapp:
$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--local LOCAL]
                 [--ansible-to-python role   --ansible-to-ast role   --zipapp file.pyz]
[...]
  --zipapp file.pyz     bundle this playbook in a self-contained executable
                        python zipapp
Loading assets from the zipapp I had to create ZipFile varieties of some bits of infrastructure in Transilience, to load templates, files, and Ansible yaml files from zip files. You can see above a way to detect if a module is loaded from a zipfile: check if the module's __loader__ attribute has an archive attribute. Here's a Jinja2 template loader that looks into a zip:
class ZipLoader(jinja2.BaseLoader):
    def __init__(self, archive: zipfile.ZipFile, root: str):
        self.zipfile = archive
        self.root = root
    def get_source(self, environment: jinja2.Environment, template: str):
        path = os.path.join(self.root, template)
        with self.zipfile.open(path, "r") as fd:
            source = fd.read().decode()
        return source, None, lambda: True
I also created a FileAsset abstract interface to represent a local file, and had Role.lookup_file return an appropriate instance:
    def lookup_file(self, path: str) -> str:
        """
        Resolve a pathname inside the place where the role assets are stored.
        Returns a pathname to the file
        """
        if self.role_assets_zipfile is not None:
            return ZipFileAsset(self.role_assets_zipfile, os.path.join(self.role_assets_root, path))
        else:
            return LocalFileAsset(os.path.join(self.role_assets_root, path))
An interesting side effect of having smarter local file accessors is that I can cache the contents of small files and transmit them to the remote host together with the other action parameters, saving a potential network round trip for each builtin.copy action that has a small source. The result The result is kind of fun:
$ time ./provision --zipapp test.pyz
real    0m0.203s
user    0m0.174s
sys 0m0.029s
$ time scp test.pyz root@test:
test.pyz                                                                                                         100%  528KB 388.9KB/s   00:01
real    0m1.576s
user    0m0.010s
sys 0m0.007s
And on the remote:
# time ./test.pyz --local=test
2021-06-29 18:05:41,546 test: [connected 0.000s]
[...]
2021-06-29 18:12:31,555 test: 88 total actions in 0.00ms: 87 unchanged, 0 changed, 1 skipped, 0 failed, 0 not executed.
real    0m0.979s
user    0m0.783s
sys 0m0.172s
Compare with a Mitogen run:
$ time PYTHONPATH=../transilience/ ./provision
2021-06-29 18:13:44 test: [connected 0.427s]
[...]
2021-06-29 18:13:46 test: 88 total actions in 2.50s: 87 unchanged, 0 changed, 1 skipped, 0 failed, 0 not executed.
real    0m2.697s
user    0m0.856s
sys 0m0.042s
From a single test run, not a good benchmark, it's 0.203 + 1.576 + 0.979 = 2.758s with the zipapp and 2.697s with Mitogen. Even if I've been lucky, it's a similar order of magnitude. What can I use this for? This was mostly a fun hack. It could however be the basis for a Fabric-based connector, or a clusterssh-based connector, or for bundling a Transilience playbook into an installation image, or to add a provisioning script to the boot partition of a Raspberry Pi. It looks like an interesting trick to have up one's sleeve. One could even build an Ansible-based connector(!) in which a simple Ansible playbook, with no facts gathering, is used to build the zipapp, push it to remote systems and run it. That would be the wackiest way of speeding up Ansible, ever! Next: using Systemd containers with unittest, for Transilience's test suite.

4 June 2021

Matthew Garrett: Mike Lindell's Cyber "Evidence"

Mike Lindell, notable for absolutely nothing relevant in this field, today filed a lawsuit against a couple of voting machine manufacturers in response to them suing him for defamation after he claimed that they were covering up hacks that had altered the course of the US election. Paragraph 104 of his suit asserts that he has evidence of at least 20 documented hacks, including the number of votes that were changed. The citation is just a link to a video called Absolute 9-0, which claims to present sufficient evidence that the US supreme court will come to a 9-0 decision that the election was tampered with.

The claim is that Lindell was provided with a set of files on the 9th of January, and gave these to some cyber experts to verify. These experts identified them as packet captures. The video contains scrolling hex, and we are told that this is the raw encrypted data from the files. In reality, the hex values correspond very clearly to printable ASCII, and appear to just be the Pennsylvania voter roll. They're not encrypted, and they're not packet captures (they contain no packet headers).

20 of these packet captures were then selected and analysed, giving us the tables contained within Exhibit 12. The alleged source IPs appear to correspond to the networks the tables claim, and the latitude and longitude presumably just come from a geoip lookup of some sort (although clearly those values are far too precise to be accurate). But if we look at the target IPs, we find something interesting. Most of them resolve to the website for the county that was the nominal target (eg, 198.108.253.104 is www.deltacountymi.org). So, we're supposed to believe that in many cases, the county voting infrastructure was hosted on the county website.

Unfortunately we're not given the destination port, but 198.108.253.104 isn't listening on anything other than 80 and 443. We're told that the packet data is encrypted, so presumably it's over HTTPS. So, uh, how did they decrypt this to figure out how many votes were switched? If Mike's hackers have broken TLS, they really don't need to be dealing with this.

We're also given some background information on how it's impossible to reconstruct packet captures after the fact (untrue), or that modifying them would change their hashes (true, but in the absence of known good hash values that tells us nothing), but it's pretty clear that nothing we're shown actually demonstrates what we're told it does.

In summary: yes, any supreme court decision on this would be 9-0, just not the way he's hoping for.

Update: It was pointed out that this data appears to be part of a larger dataset. This one is even more dubious - it somehow has MAC addresses for both the source and destination (which is impossible), and almost none of these addresses are in actual issued ranges.

comment count unavailable comments

25 May 2021

Vincent Bernat: Jerikan+Ansible: a configuration management system for network

There are many resources for network automation with Ansible. Most of them only expose the first steps or limit themselves to a narrow scope. They give no clue on how to expand from that. Real network environments may be large, versatile, heterogeneous, and filled with exceptions. The lack of real-world examples for Ansible deployments, unlike Puppet and SaltStack, leads many teams to build brittle and incomplete automation solutions. We have released under an open-source license our attempt to tackle this problem:
  • Jerikan, a tool to build configuration files from a single source of truth and Jinja2 templates, along with its integration into the GitLab CI system;
  • an Ansible playbook to deploy these configuration files on network devices; and
  • the configuration data and the templates for our, now defunct, datacenters in San Francisco and South Korea, covering many vendors (Facebook Wedge 100, Dell S4048 and S6010, Juniper QFX 5110, Juniper QFX 10002, Cisco ASR 9001, Cisco Catalyst 2960, Opengear console servers, and Linux), and many functionalities (provisioning, BGP-to-the-host routing, edge routing, out-of-band network, DNS configuration, integration with NetBox and IRRs).
Here is a quick demo to configure a new peering:
This work is the collective effort of C dric Hasco t, Jean-Christophe Legatte, Lo c Pailhas, S bastien Hurtel, Tchadel Icard, and Vincent Bernat. We are the network team of Blade, a French company operating Shadow, a cloud-computing product. In May 2021, our company was bought by Octave Klaba and the infrastructure is being transferred to OVHcloud, saving Shadow as a product, but making our team redundant. Our network was around 800 devices, spanning over 10 datacenters with more than 2.5 Tbps of available egress bandwidth. The released material is therefore a substantial example of managing a medium-scale network using Ansible. We have left out the handling of our legacy datacenters to make the final result more readable while keeping enough material to not turn it into a trivial example.

Jerikan The first component is Jerikan. As input, it takes a list of devices, configuration data, templates, and validation scripts. It generates a set of configuration files for each device. Ansible could cover this task, but it has the following limitations:
  • it is slow;
  • errors are difficult to debug;1 and
  • the hierarchy to look up a variable is rigid.
Jerikan inputs and outputs
Jerikan inputs and outputs
If you want to follow the examples, you only need to have Docker and Docker Compose installed. Clone the repository and you are ready!

Source of truth We use YAML files, versioned with Git, as the single source of truth instead of using a database, like NetBox, or a mix of a database and text files. This provides many advantages:
  • anyone can use their preferred text editor;
  • the team prepares changes in branches;
  • the team reviews changes using merge requests;
  • the merge requests expose the changes to the generated configuration files;
  • rollback to a previous state is easy; and
  • it is fast.
The first file is devices.yaml. It contains the device list. The second file is classifier.yaml. It defines a scope for each device. A scope is a set of keys and values. It is used in templates and to look up data associated with a device.
$ ./run-jerikan scope to1-p1.sk1.blade-group.net
continent: apac
environment: prod
groups:
- tor
- tor-bgp
- tor-bgp-compute
host: to1-p1.sk1
location: sk1
member: '1'
model: dell-s4048
os: cumulus
pod: '1'
shorthost: to1-p1
The device name is matched against a list of regular expressions and the scope is extended by the result of each match. For to1-p1.sk1.blade-group.net, the following subset of classifier.yaml defines its scope:
matchers:
  - '^(([^.]*)\..*)\.blade-group\.net':
      environment: prod
      host: '\1'
      shorthost: '\2'
  - '\.(sk1)\.':
      location: '\1'
      continent: apac
  - '^to([12])-[as]?p(\d+)\.':
      member: '\1'
      pod: '\2'
  - '^to[12]-p\d+\.':
      groups:
        - tor
        - tor-bgp
        - tor-bgp-compute
  - '^to[12]-(p ap)\d+\.sk1\.':
      os: cumulus
      model: dell-s4048
The third file is searchpaths.py. It describes which directories to search for a variable. A Python function provides a list of paths to look up in data/ for a given scope. Here is a simplified version:2
def searchpaths(scope):
    paths = [
        "host/ scope[location] / scope[shorthost] ",
        "location/ scope[location] ",
        "os/ scope[os] - scope[model] ",
        "os/ scope[os] ",
        'common'
    ]
    for idx in range(len(paths)):
        try:
            paths[idx] = paths[idx].format(scope=scope)
        except KeyError:
            paths[idx] = None
    return [path for path in paths if path]
With this definition, the data for to1-p1.sk1.blade-group.net is looked up in the following paths:
$ ./run-jerikan scope to1-p1.sk1.blade-group.net
[ ]
Search paths:
  host/sk1/to1-p1
  location/sk1
  os/cumulus-dell-s4048
  os/cumulus
  common
Variables are scoped using a namespace that should be specified when doing a lookup. We use the following ones:
  • system for accounts, DNS, syslog servers,
  • topology for ports, interfaces, IP addresses, subnets,
  • bgp for BGP configuration
  • build for templates and validation scripts
  • apps for application variables
When looking up for a variable in a given namespace, Jerikan looks for a YAML file named after the namespace in each directory in the search paths. For example, if we look up a variable for to1-p1.sk1.blade-group.net in the bgp namespace, the following YAML files are processed: host/sk1/to1-p1/bgp.yaml, location/sk1/bgp.yaml, os/cumulus-dell-s4048/bgp.yaml, os/cumulus/bgp.yaml, and common/bgp.yaml. The search stops at the first match. The schema.yaml file allows us to override this behavior by asking to merge dictionaries and arrays across all matching files. Here is an excerpt of this file for the topology namespace:
system:
  users:
    merge: hash
  sampling:
    merge: hash
  ansible-vars:
    merge: hash
  netbox:
    merge: hash
The last feature of the source of truth is the ability to use Jinja2 templates for keys and values by prefixing them with ~ :
# In data/os/junos/system.yaml
netbox:
  manufacturer: Juniper
  model: "~  model upper  "
# In data/groups/tor-bgp-compute/system.yaml
netbox:
  role: net_tor_gpu_switch
Looking up for netbox in the system namespace for to1-p2.ussfo03.blade-group.net yields the following result:
$ ./run-jerikan scope to1-p2.ussfo03.blade-group.net
continent: us
environment: prod
groups:
- tor
- tor-bgp
- tor-bgp-compute
host: to1-p2.ussfo03
location: ussfo03
member: '1'
model: qfx5110-48s
os: junos
pod: '2'
shorthost: to1-p2
[ ]
Search paths:
[ ]
  groups/tor-bgp-compute
[ ]
  os/junos
  common
$ ./run-jerikan lookup to1-p2.ussfo03.blade-group.net system netbox
manufacturer: Juniper
model: QFX5110-48S
role: net_tor_gpu_switch
This also works for structured data:
# In groups/adm-gateway/topology.yaml
interface-rescue:
  address: "~  lookup('topology', 'addresses').rescue  "
  up:
    - "~ip route add default via   lookup('topology', 'addresses').rescue ipaddr('first_usable')   table rescue"
    - "~ip rule add from   lookup('topology', 'addresses').rescue ipaddr('address')   table rescue priority 10"
# In groups/adm-gateway-sk1/topology.yaml
interfaces:
  ens1f0: "~  lookup('topology', 'interface-rescue')  "
This yields the following result:
$ ./run-jerikan lookup gateway1.sk1.blade-group.net topology interfaces
[ ]
ens1f0:
  address: 121.78.242.10/29
  up:
  - ip route add default via 121.78.242.9 table rescue
  - ip rule add from 121.78.242.10 table rescue priority 10
When putting data in the source of truth, we use the following rules:
  1. Don t repeat yourself.
  2. Put the data in the most specific place without breaking the first rule.
  3. Use templates with parsimony, mostly to help with the previous rules.
  4. Restrict the data model to what is needed for your use case.
The first rule is important. For example, when specifying IP addresses for a point-to-point link, only specify one side and deduce the other value in the templates. The last rule means you do not need to mimic a BGP YANG model to specify BGP peers and policies:
peers:
  transit:
    cogent:
      asn: 174
      remote:
        - 38.140.30.233
        - 2001:550:2:B::1F9:1
      specific-import:
        - name: ATT-US
          as-path: ".*7018$"
          lp-delta: 50
  ix-sfmix:
    rs-sfmix:
      monitored: true
      asn: 63055
      remote:
        - 206.197.187.253
        - 206.197.187.254
        - 2001:504:30::ba06:3055:1
        - 2001:504:30::ba06:3055:2
    blizzard:
      asn: 57976
      remote:
        - 206.197.187.42
        - 2001:504:30::ba05:7976:1
      irr: AS-BLIZZARD

Templates The list of templates to compile for each device is stored in the source of truth, under the build namespace:
$ ./run-jerikan lookup edge1.ussfo03.blade-group.net build templates
data.yaml: data.j2
config.txt: junos/main.j2
config-base.txt: junos/base.j2
config-irr.txt: junos/irr.j2
$ ./run-jerikan lookup to1-p1.ussfo03.blade-group.net build templates
data.yaml: data.j2
config.txt: cumulus/main.j2
frr.conf: cumulus/frr.j2
interfaces.conf: cumulus/interfaces.j2
ports.conf: cumulus/ports.j2
dhcpd.conf: cumulus/dhcp.j2
default-isc-dhcp: cumulus/default-isc-dhcp.j2
authorized_keys: cumulus/authorized-keys.j2
motd: linux/motd.j2
acl.rules: cumulus/acl.j2
rsyslog.conf: cumulus/rsyslog.conf.j2
Templates are using Jinja2. This is the same engine used in Ansible. Jerikan ships some custom filters but also reuse some of the useful filters from Ansible, notably ipaddr. Here is an excerpt of templates/junos/base.j2 to configure DNS and NTP servers on Juniper devices:
system  
  ntp  
 % for ntp in lookup("system", "ntp") % 
    server   ntp  ;
 % endfor % 
   
  name-server  
 % for dns in lookup("system", "dns") % 
      dns  ;
 % endfor % 
   
 
The equivalent template for Cisco IOS-XR is:
 % for dns in lookup('system', 'dns') % 
domain vrf VRF-MANAGEMENT name-server   dns  
 % endfor % 
!
 % for syslog in lookup('system', 'syslog') % 
logging   syslog   vrf VRF-MANAGEMENT
 % endfor % 
!
There are three helper functions provided:
  • devices() returns the list of devices matching a set of conditions on the scope. For example, devices("location==ussfo03", "groups==tor-bgp") returns the list of devices in San Francisco in the tor-bgp group. You can also omit the operator if you want the specified value to be equal to the one in the local scope. For example, devices("location") returns devices in the current location.
  • lookup() does a key lookup. It takes the namespace, the key, and optionally, a device name. If not provided, the current device is assumed.
  • scope() returns the scope of the provided device.
Here is how you would define iBGP sessions between edge devices in the same location:
 % for neighbor in devices("location", "groups==edge") if neighbor != device % 
   % for address in lookup("topology", "addresses", neighbor).loopback tolist % 
protocols bgp group IPV  address ipv  -EDGES-IBGP  
  neighbor   address    
    description "IPv  address ipv  : iBGP to   neighbor  ";
   
 
   % endfor % 
 % endfor % 
We also have a global key-value store to save information to be reused in another template or device. This is quite useful to automatically build DNS records. First, capture the IP address inserted into a template with store() as a filter:
interface Loopback0
 description 'Loopback:'
  % for address in lookup('topology', 'addresses').loopback tolist % 
 ipv  address ipv   address   address store('addresses', 'Loopback0') ipaddr('cidr')  
  % endfor % 
!
Then, reuse it later to build DNS records by iterating over store():4
 % for device, ip, interface in store('addresses') % 
   % set interface = interface replace('/', '-') replace('.', '-') replace(':', '-') % 
   % set name = ' . '.format(interface lower, device) % 
  name  . IN   'A' if ip ipv4 else 'AAAA'     ip ipaddr('address')  
 % endfor % 
Templates are compiled locally with ./run-jerikan build. The --limit argument restricts the devices to generate configuration files for. Build is not done in parallel because a template may depend on the data collected by another template. Currently, it takes 1 minute to compile around 3000 files spanning over 800 devices.
Jerikan outputs when building templates
Output of Jerikan after building configuration files for six devices
When an error occurs, a detailed traceback is displayed, including the template name, the line number and the value of all visible variables. This is a major time-saver compared to Ansible!
templates/opengear/config.j2:15: in top-level template code
    config.interfaces.  interface  .netmask   adddress   ipaddr("netmask")  
        continent  = 'us'
        device     = 'con1-ag2.ussfo03.blade-group.net'
        environment = 'prod'
        host       = 'con1-ag2.ussfo03'
        infos      =  'address': '172.30.24.19/21' 
        interface  = 'wan'
        location   = 'ussfo03'
        loop       = <LoopContext 1/2>
        member     = '2'
        model      = 'cm7132-2-dac'
        os         = 'opengear'
        shorthost  = 'con1-ag2'
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
value = JerkianUndefined, query = 'netmask', version = False, alias = 'ipaddr'
[ ]
        # Check if value is a list and parse each element
        if isinstance(value, (list, tuple, types.GeneratorType)):
            _ret = [ipaddr(element, str(query), version) for element in value]
            return [item for item in _ret if item]
>       elif not value or value is True:
E       jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'adddress'
We don t have general-purpose rules when writing templates. Like for the source of truth, there is no need to create generic templates able to produce any BGP configuration. There is a balance to be found between readability and avoiding duplication. Templates can become scary and complex: sometimes, it s better to write a filter or a function in jerikan/jinja.py. Mastering Jinja2 is a good investment. Take time to browse through our templates as some of them show interesting features.

Checks Optionally, each configuration file can be validated by a script in the checks/ directory. Jerikan looks up the key checks in the build namespace to know which checks to run:
$ ./run-jerikan lookup edge1.ussfo03.blade-group.net build checks
- description: Juniper configuration file syntax check
  script: checks/junoser
  cache:
    input: config.txt
    output: config-set.txt
- description: check YAML data
  script: checks/data.yaml
  cache: data.yaml
In the above example, checks/junoser is executed if there is a change to the generated config.txt file. It also outputs a transformed version of the configuration file which is easier to understand when using diff. Junoser checks a Junos configuration file using Juniper s XML schema definition for Netconf.5 On error, Jerikan displays:
jerikan/build.py:127: RuntimeError
-------------- Captured syntax check with Junoser call --------------
P: checks/junoser edge2.ussfo03.blade-group.net
C: /app/jerikan
O:
E: Invalid syntax:  set system syslog archive size 10m files 10 word-readable
S: 1

Integration into GitLab CI The next step is to compile the templates using a CI. As we are using GitLab, Jerikan ships with a .gitlab-ci.yml file. When we need to make a change, we create a dedicated branch and a merge request. GitLab compiles the templates using the same environment we use on our laptops and store them as an artifact.
Merge requests in GitLab for a change
Merge request to add a new port in USSFO03. The templates were compiled successfully but approval from another team member is still required to merge.
Before approving the merge request, another team member looks at the changes in data and templates but also the differences for the generated configuration files:
Differences for generated configuration files
The change configures a port on the Juniper device, adds records to DNS, and updates NetBox with the new IP addresses (not shown).

Ansible After Jerikan has built the configuration files, Ansible takes over. It is also packaged as a Docker image to avoid the trouble to maintain the right Python virtual environment and ensure everyone is using the same versions.

Inventory Jerikan has generated an inventory file. It contains all the managed devices, the variables defined for each of them and the groups converted to Ansible groups:
ob1-n1.sk1.blade-group.net ansible_host=172.29.15.12 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
ob2-n1.sk1.blade-group.net ansible_host=172.29.15.13 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
ob1-n1.ussfo03.blade-group.net ansible_host=172.29.15.12 ansible_user=blade ansible_connection=network_cli ansible_network_os=ios
none ansible_connection=local
[oob]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[os-ios]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[model-c2960s]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
[location-sk1]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
[location-ussfo03]
ob1-n1.ussfo03.blade-group.net
[in-sync]
ob1-n1.sk1.blade-group.net
ob2-n1.sk1.blade-group.net
ob1-n1.ussfo03.blade-group.net
none
in-sync is a special group for devices which configuration should match the golden configuration. Daily and unattended, Ansible should be able to push configurations to this group. The mid-term goal is to cover all devices. none is a special device for tasks not related to a specific host. This includes synchronizing NetBox, IRR objects, and the DNS, updating the RPKI, and building the geofeed files.

Playbook We use a single playbook for all devices. It is described in the ansible/playbooks/site.yaml file. Here is a shortened version:
- hosts: adm-gateway:!done
  strategy: mitogen_linear
  roles:
    - blade.linux
    - blade.adm-gateway
    - done
- hosts: os-linux:!done
  strategy: mitogen_linear
  roles:
    - blade.linux
    - done
- hosts: os-junos:!done
  gather_facts: false
  roles:
    - blade.junos
    - done
- hosts: os-opengear:!done
  gather_facts: false
  roles:
    - blade.opengear
    - done
- hosts: none:!done
  gather_facts: false
  roles:
    - blade.none
    - done
A host executes only one of the play. For example, a Junos device executes the blade.junos role. Once a play has been executed, the device is added to the done group and the other plays are skipped. The playbook can be executed with the configuration files generated by the GitLab CI using the ./run-ansible-gitlab command. This is a wrapper around Docker and the ansible-playbook command and it accepts the same arguments. To deploy the configuration on the edge devices for the SK1 datacenter in check mode, we use:
$ ./run-ansible-gitlab playbooks/site.yaml --limit='edge:&location-sk1' --diff --check
[ ]
PLAY RECAP *************************************************************
edge1.sk1.blade-group.net  : ok=6    changed=0    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0
edge2.sk1.blade-group.net  : ok=5    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0
We have some rules when writing roles:
  • --check must detect if a change is needed;
  • --diff must provide a visualization of the planned changes;
  • --check and --diff must not display anything if there is nothing to change;
  • writing a custom module tailored to our needs is a valid solution;
  • the whole device configuration is managed;6
  • secrets must be stored in Vault;
  • templates should be avoided as we have Jerikan for that; and
  • avoid duplication and reuse tasks.7
We avoid using collections from Ansible Galaxy, the exception being collections to connect and interact with vendor devices, like cisco.iosxr collection. The quality of Ansible Galaxy collections is quite random and it is an additional maintenance burden. It seems better to write roles tailored to our needs. The collections we use are in ci/ansible/ansible-galaxy.yaml. We use Mitogen to get a 10 speedup on Ansible executions on Linux hosts. We also have a few playbooks for operational purpose: upgrading the OS version, isolate an edge router, etc. We were also planning on how to add operational checks in roles: are all the BGP sessions up? They could have been used to validate a deployment and rollback if there is an issue. Currently, our playbooks are run from our laptops. To keep tabs, we are using ARA. A weekly dry-run on devices in the in-sync group also provides a dashboard on which devices we need to run Ansible on.

Configuration data and templates Jerikan ships with pre-populated data and templates matching the configuration of our USSFO03 and SK1 datacenters. They do not exist anymore but, we promise, all this was used in production back in the days!
Network architecture for Blade datacenter
The latest iteration of our network infrastructure for SK1, USSFO03, and future data centers. The production network is using BGPttH using a spine-leaf fabric. The out-of-band network is using a simple L2 design, using the spanning tree protocol, as well as a set of console servers.
Notably, you can find the configuration for:
our edge routers
Some are running on Junos, like edge2.ussfo03, the others on IOS-XR, like edge1.sk1. The implemented functionalities are similar in both cases and we could swap one for the other. It includes the BGP configuration for transits, peerings, and IX as well as the associated BGP policies. PeeringDB is queried to get the maximum number of prefixes to accept for peerings. bgpq3 and a containerized IRRd help to filter received routes. A firewall is added to protect the routing engine. Both IPv4 and IPv6 are configured.
our BGP-based fabric
BGP is used inside the datacenter8 and is extended on bare-metal hosts. The configuration is automatically derived from the device location and the port number.9 Top-of-the-rack devices are using passive BGP sessions for ports towards servers. They are also serving a provisioning network to let them boot using DHCP and PXE. They also act as a DHCP server. The design is multivendor. Some devices are running Cumulus Linux, like to1-p1.ussfo03, while some others are running Junos, like to1-p2.ussfo03.
our out-of-band fabric
We are using Cisco Catalyst 2960 switches to build an L2 out-of-band network. To provide redundancy and saving a few bucks on wiring, we build small loops and run the spanning-tree protocol. See ob1-p1.ussfo03. It is redundantly connected to our gateway servers. We also use OpenGear devices for console access. See con1-n1.ussfo03
our administrative gateways
These Linux servers have multiple purposes: SSH jump boxes, rescue connection, direct access to the out-of-band network, zero-touch provisioning of network devices,10 Internet access for management flows, centralization of the console servers using Conserver, and API for autoconfiguration of BGP sessions for bare-metal servers. They are the first servers installed in a new datacenter and are used to provision everything else. Check both the generated files and the associated Ansible tasks.

  1. Ansible does not even provide a line number when there is an error in a template. You may need to find the problem by bisecting.
    $ ansible --version
    ansible 2.10.8
    [ ]
    $ cat test.j2
    Hello   name  !
    $ ansible all -i localhost, \
    >  --connection=local \
    >  -m template \
    >  -a "src=test.j2 dest=test.txt"
    localhost   FAILED! =>  
        "changed": false,
        "msg": "AnsibleUndefinedVariable: 'name' is undefined"
     
    
  2. You may recognize the same concepts as in Hiera, the hierarchical key-value store from Puppet. At first, we were using Jerakia, a similar independent store exposing an HTTP REST interface. However, the lookup overhead is too large for our use. Jerikan implements the same functionality within a Python function.
  3. The list of available filters is mangled inside jerikan/jinja.py. This is a remain of the fact we do not maintain Jerikan as a standalone software.
  4. This is a bit confusing: we have a store() filter and a store() function. With Jinja2, filters and functions live in two different namespaces.
  5. We are using a fork with some modifications to be able to validate our configurations and exposing an HTTP service to reduce the time spent on each configuration check.
  6. There is a trend in network automation to automate a configuration subset, for example by having a playbook to create a new BGP session. We believe this is wrong: with time, your configuration will get out-of-sync with its expected state, notably hand-made changes will be left undetected.
  7. See ansible/roles/blade.linux/tasks/firewall.yaml and ansible/roles/blade.linux/tasks/interfaces.yaml. They are meant to be called when needed, using import_role.
  8. We also have some datacenters using BGP EVPN VXLAN at medium-scale using Juniper devices. As they are still in production today, we didn t include this feature but we may publish it in the future.
  9. In retrospect, this may not be a good idea unless you are pretty sure everything is uniform (number of switches for each layer, number of ports). This was not our case. We now think it is a better idea to assign a prefix to each device and write it in the source of truth.
  10. Non-linux based devices are upgraded and configured unattended. Cumulus Linux devices are automatically upgraded on install but the final configuration has to be pushed using Ansible: we didn t want to duplicate the configuration process using another tool.

3 May 2021

Russell Coker: DNS, Lots of IPs, and Postal

I decided to start work on repeating the tests for my 2006 OSDC paper on Benchmarking Mail Relays [1] and discover how the last 15 years of hardware developments have changed things. There have been software changes in that time too, but nothing that compares with going from single core 32bit systems with less than 1G of RAM and 60G IDE disks to multi-core 64bit systems with 128G of RAM and SSDs. As an aside the hardware I used in 2006 wasn t cutting edge and the hardware I m using now isn t either. In both cases it s systems I bought second hand for under $1000. Pedants can think of this as comparing 2004 and 2018 hardware. BIND I decided to make some changes to reflect the increased hardware capacity and use 2560 domains and IP addresses, which gave the following errors as well as a startup time of a minute on a system with two E5-2620 CPUs.
May  2 16:38:37 server named[7372]: listening on IPv4 interface lo, 127.0.0.1#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.2.45#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.40.1#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.40.2#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.40.3#53
[...]
May  2 16:39:33 server named[7372]: listening on IPv4 interface eno4, 10.0.47.0#53
May  2 16:39:33 server named[7372]: listening on IPv4 interface eno4, 10.0.48.0#53
May  2 16:39:33 server named[7372]: listening on IPv4 interface eno4, 10.0.49.0#53
May  2 16:39:33 server named[7372]: listening on IPv6 interface lo, ::1#53
[...]
May  2 16:39:36 server named[7372]: zone localhost/IN: loaded serial 2
May  2 16:39:36 server named[7372]: all zones loaded
May  2 16:39:36 server named[7372]: running
May  2 16:39:36 server named[7372]: socket: file descriptor exceeds limit (123273/21000)
May  2 16:39:36 server named[7372]: managed-keys-zone: Unable to fetch DNSKEY set '.': not enough free resources
May  2 16:39:36 server named[7372]: socket: file descriptor exceeds limit (123273/21000)
The first thing I noticed is that a default configuration of BIND with 2560 local IPs (when just running in the default recursive mode) takes a minute to start and needed to open over 100,000 file handles. BIND also had some errors in that configuration which led to it not accepting shutdown requests. I filed Debian bug report #987927 [2] about this. One way of dealing with the errors in this situation on Debian is to edit /etc/default/named and put in the following line to allow BIND to access to many file handles:
OPTIONS="-u bind -S 150000"
But the best thing to do for BIND when there are many IP addresses that aren t going to be used for DNS service is to put a directive like the following in the BIND configuration to specify the IP address or addresses that are used for the DNS service:
listen-on   10.0.2.45;  ;
I have just added the listen-on and listen-on-v6 directives to one of my servers with about a dozen IP addresses. While 2560 IP addresses is an unusual corner case it s not uncommon to have dozens of addresses on one system. dig When doing tests of Postfix for relaying mail I noticed that mail was being deferred with DNS problems (error was Host or domain name not found. Name service error for name=a838.example.com type=MX: Host not found, try again . I tested the DNS lookups with dig which failed with errors like the following:
dig -t mx a704.example.com
socket.c:1740: internal_send: 10.0.2.45#53: Invalid argument
socket.c:1740: internal_send: 10.0.2.45#53: Invalid argument
socket.c:1740: internal_send: 10.0.2.45#53: Invalid argument
; <
> DiG 9.16.13-Debian <
> -t mx a704.example.com
;; global options: +cmd
;; connection timed out; no servers could be reached
Here is a sample of the strace output from tracing dig:
bind(20,  sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0") , 16) = 0
recvmsg(20,  msg_namelen=128 , 0)       = -1 EAGAIN (Resource temporarily 
unavailable)
write(4, "\24\0\0\0\375\377\377\377", 8) = 8
sendmsg(20,  msg_name= sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.0.2.45") , msg_
namelen=16, msg_iov=[ iov_base="86\1 
\0\1\0\0\0\0\0\1\4a704\7example\3com\0\0\17\0\1\0\0)\20\0\0\0\0
\0\0\f\0\n\0\10's\367\265\16bx\354", iov_len=57 ], msg_iovlen=1, 
msg_controllen=0, msg_flags=0 , 0) 
= -1 EINVAL (Invalid argument)
write(2, "socket.c:1740: ", 15)         = 15
write(2, "internal_send: 10.0.2.45#53: Invalid argument", 45) = 45
write(2, "\n", 1)                       = 1
futex(0x7f5a80696084, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x7f5a80696010, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f5a8069809c, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f5a80698020, FUTEX_WAKE_PRIVATE, 1) = 1
sendmsg(20,  msg_name= sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.0.2.45") , msg_namelen=16, msg_iov=[ iov_base="86\1 
\0\1\0\0\0\0\0\1\4a704\7example\3com\0\0\17\0\1\0\0)\20\0\0\0\0\0\0\f\0\n\0\10's\367\265\16bx\354", 
iov_len=57 ], msg_iovlen=1, msg_controllen=0, msg_flags=0 , 0) = -1 EINVAL 
(Invalid argument)
write(2, "socket.c:1740: ", 15)         = 15
write(2, "internal_send: 10.0.2.45#53: Invalid argument", 45) = 45
write(2, "\n", 1)
Ubuntu bug #1702726 claims that an insufficient ARP cache was the cause of dig problems [3]. At the time I encountered the dig problems I was seeing lots of kernel error messages neighbour: arp_cache: neighbor table overflow which I solved by putting the following in /etc/sysctl.d/mine.conf:
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024
Making that change (and having rebooted because I didn t need to run the server overnight) didn t entirely solve the problems. I have seen some DNS errors from Postfix since then but they are less common than before. When they happened I didn t have that error from dig. At this stage I m not certain that the ARP change fixed the dig problem although it seems likely (it s always difficult to be certain that you have solved a race condition instead of made it less common or just accidentally changed something else to conceal it). But it is clearly a good thing to have a large enough ARP cache so the above change is probably the right thing for most people (with the possibility of changing the numbers according to the required scale). Also people having that dig error should probably check their kernel message log, if the ARP cache isn t the cause then some other kernel networking issue might be related. Preliminary Results With Postfix I m seeing around 24,000 messages relayed per minute with more than 60% CPU time idle. I m not sure exactly how to count idle time when there are 12 CPU cores and 24 hyper-threads as having only 1 process scheduled for each pair of hyperthreads on a core is very different to having half the CPU cores unused. I ran my script to disable hyper-threads by telling the Linux kernel to disable each processor core that has the same core ID as another, it was buggy and disabled the second CPU altogether (better than finding this out on a production server). Going from 24 hyper-threads of 2 CPUs to 6 non-HT cores of a single CPU didn t change the thoughput and the idle time went to about 30%, so I have possibly halved the CPU capacity for these tasks by disabling all hyper-threads and one entire CPU which is surprising given that I theoretically reduced the CPU power by 75%. I think my focus now has to be on hyper-threading optimisation. Since 2006 the performance has gone from ~20 messages per minute on relatively commodity hardware to 24,000 messages per minute on server equipment that is uncommon for home use but which is also within range of home desktop PCs. I think that a typical desktop PC with a similar speed CPU, 32G of RAM and SSD storage would give the same performance. Moore s Law (that transistor count doubles approximately every 2 years) is often misquoted as having performance double every 2 years. In this case more than 1024* the performance over 15 years means the performance doubling every 18 months. Probably most of that is due to SATA SSDs massively outperforming IDE hard drives but it s still impressive. Notes I ve been using example.com for test purposes for a long time, but RFC2606 specifies .test, .example, and .invalid as reserved top level domains for such things. On the next iteration I ll change my scripts to use .test. My current test setup has a KVM virtual machine running my bhm program to receive mail which is taking between 20% and 50% of a CPU core in my tests so far. While that is happening the kvm process is reported as taking between 60% and 200% of a CPU core, so kvm takes as much as 4* the CPU of the guest due to the virtual networking overhead even though I m using the virtio-net-pci driver (the most efficient form of KVM networking for emulating a regular ethernet card). I ve also seen this in production with a virtual machine running a ToR relay node. I ve fixed a bug where Postal would try to send the SMTP quit command after encountering a TCP error which would cause an infinite loop and SEGV.

1 April 2021

Paul Wise: FLOSS Activities March 2021

Focus This month I didn't have any particular focus. I just worked on issues in my info bubble.

Changes

Issues

Debugging

Review

Administration
  • Debian packages: migrate flower git repo from alioth-archive to salsa
  • Debian: restart bacula-director after PostgreSQL restart
  • Debian wiki: block spammer, clean up spam, approve accounts

Communication

Sponsors The librecaptcha/libpst/flower/marco work was sponsored by my employers. All other work was done on a volunteer basis.

Next.

Previous.